Add vectorized fillNextPrimes() algorithm for other CPU archtectures (e.g. arm64) #114

kimwalisch · 2022-05-09T09:39:24Z

primesieve::iterator's performance depends heavily on the fillNextPrimes() method from PrimeGenerator.cpp. For x64 we have a vectorized AVX512 algorithm that is pretty optimal for this task. Once other CPU architectures (e.g. arm64) support 512-bit vector instructions like AVX512 we should port our AVX512 algorithm to these CPU architectures.

ARM has recently added (2021) the Scalable Vector Extension (SVE) to its CPUs. SVE is supposed to be a portable vector instruction set that works with different vector instructions widths. However for vectorizing our fillNextPrimes() method we need at least 512-bit vector instructions.

The text was updated successfully, but these errors were encountered:

kimwalisch · 2024-02-25T19:42:35Z

Primesieve's AVX512 fillNextPrimes() algorithm is close to optimal for converting 1-bits from the sieve array into the corresponding bit indexes/positions (and next into prime numbers). The algorithm mainly relies on the VPCOMPRESSB and VPERMB instructions from AVX512.

According to ChatGPT the new ARM SVE instruction set has those instructions as well.

Yes, ARM Scalable Vector Extension (SVE) does offer a similar functionality to Intel AVX-512's VPCOMPRESSB instruction. In ARM SVE, the svcompress_b instruction serves a similar purpose.
Yes, ARM Scalable Vector Extension (SVE) provides a similar functionality to Intel's AVX-512 VPERMB instruction. In SVE, you can achieve similar byte-wise permute operations using the sveperm instruction family.

This means that it should be relatively easy to port Primesieve's AVX512 fillNextPrimes() algorithm to ARM SVE. But there is one big drawback, in ARM SVE the vector length is dynamic which adds another layer of complexity for implementing this algorithm.

kimwalisch · 2024-02-26T17:04:52Z

According to ARM’s Scalable Vector Extensions: A Critical Look at SVE2 For Integer Workloads, as of 2024 there are no ARM CPUs out yet that support SVE or SV2. And SVE2 will likely be limited to 128-bits for the next few years, on top of that the dynamic vector length of ARM SVE is a pain in the ass in many cases (including its use in primesieve). Maybe these issues explain why there is no wide adoption of ARM SVE yet...

It is best to wait a few years until ARM CPUs ship a vector extension of at least 256-bits (ideally we want 512-bits like in AVX512.). Ideally we want a fixed vector length instruction set, similar to ARM NEON or AVX512.

anadon mentioned this issue Dec 6, 2022

Sister project technology transfer #131

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add vectorized fillNextPrimes() algorithm for other CPU archtectures (e.g. arm64) #114

Add vectorized fillNextPrimes() algorithm for other CPU archtectures (e.g. arm64) #114

kimwalisch commented May 9, 2022

kimwalisch commented Feb 25, 2024

kimwalisch commented Feb 26, 2024

Add vectorized fillNextPrimes() algorithm for other CPU archtectures (e.g. arm64) #114

Add vectorized fillNextPrimes() algorithm for other CPU archtectures (e.g. arm64) #114

Comments

kimwalisch commented May 9, 2022

kimwalisch commented Feb 25, 2024

kimwalisch commented Feb 26, 2024