AVX-512: When bits really matter

For the majority of workloads, messing with assembly instructions isn’t worth it. The added complexity and code obfuscation usually outweigh the relatively modest gains. Mainly because compilers have gotten pretty fantastic at code generation and because processors are so much faster, it’s hard to get a significant speedup by modifying a small section of code. This changes when you introduce SIMD instructions and need to quickly decode many bitsets. from Intel fancy AVX-512 SIMD instructions can provide significant performance gains with relatively low custom assembly.

Like many software engineers, [Daniel Lemire] had many bitsets (a range of integers/enums encoded in a binary number, each bit corresponding to a different integer or enumeration). Rather than checking if a specific flag is present (a bit and), [Daniel] wanted to know all the flags in a given bitset. The easiest way would be to cycle through them all as follows:

while (word != 0) {
  result[i] = trailingzeroes(word);
  word = word & (word - 1);
  i++;
}

The naive version of this look is very likely to have bad branch prediction, and you or the compiler will speed it up by unrolling the loop. However, the AVX-512 instruction set on the latest Intel processors contains handy instructions just for this sort of thing. The instruction is vpcompressed and Intel provides a handy and memorable C/C++ function called _mm512_mask_compressstoreu_epi32.

The function generates an array of integers and you can use the famous popcnt instruction to get the number of ones. Some early benchmark tests show that the AVX-512 version uses 45% fewer cycles. You might be wondering if the CPU slows down when 512-bit wide registers are used? Yes. But even with downclocking, the SIMD version is still 33% faster. the the code is available on Github if you want to try it yourself.

Comments are closed.