Notes on computers, programming and all the stuff

poniedziałek, 13 kwietnia 2015

Speeding up bit-parallel population count

Nearly 50% faster than naive version for large data sets. Discovered by accident. :)

czwartek, 9 kwietnia 2015

Github repositories

I've put source code for my two articles at github:

Repositories contain original code, read: C99, 32-bit for GCC with inline assembly and also new programs in C++11 using intrinsics, tested in 64-bit environment.

BTW the article about popcount has gained popularity, and I hope another crazy idea about hacking MPSADBW will spread all over the world.

SIMD-ized searching in unique constant dictionary

The problem: there is a ordered dictionary containing only unique keys. Dictionary is read only, and keys are 32-bit (SSE) or 64-bit (AVX2). Read more

niedziela, 22 marca 2015

SIMD: detecting a bit pattern

The problem: there are 64-bit values with some data bits and some metadata bits; metadata includes a k-bit field describing a "type" (k >= 0). Type field is located in a lower 32-bits.

Procedure processes two "types", one denoted with code 3 and another with 5. When all items are of type 3 then we can use a fast AVX2 path, if there are some types 5, we have to call an additional function (a virtual method, to be precise). Read more ...

Compiler warnings are your future errors

Months ago I was asked to upgrade GCC from version 4.7 to 4.9 and also cleanup configure scripts. Not very exciting, merely time consuming task. Oh, we were hit by a bug in libstdc++, but simple patch have fixed the problem. Few weeks later I was asked to change GCC switch from -std=c++11 to -std=c++14 -- the easiest task in the world. I had to modify single script, run configure, type make, then run tests... everything was OK. Quite boring so far. Read more ...

AVX512: ternary functions evaluation

Intel's version of SIMD offers following 2-argument (binary) boolean functions: and, or, xor, and not. There isn't a single argument not, this function can be expressed with xor reg, ones, however this require additional, pre-set register.

AVX512F will come with very interesting instruction called vpternlog. Read more ...

sobota, 21 marca 2015

Not everything in AVX2 is 256-bit

AVX2 has added support for 256-bit arguments for many operations on packed integers, although not all. Some instructions accept the 256-bit registers, but operates on 128-bit lanes rather whole register.

There are three major groups of instructions: packing (narrowing conversion), unpacking (interleave) and permutations; below is a full list of instructions (with intrinsics):

valignr (_mm256_alignr_epi8)
vpslldq (_mm256_bslli_epi128)
vpsrldq (_mm256_bsrli_epi128)
vmpsadbw (_mm256_mpsadbw_epu8)
vpacksswb (_mm256_packs_epi16)
vpackssdw (_mm256_packs_epi32)
vpackuswb (_mm256_packus_epi16)
vpackusdw (_mm256_packus_epi32)
vperm2i128 (_mm256_permute2x128_si256)
vpermq (_mm256_permute4x64_epi64)
vpermpd (_mm256_permute4x64_pd)
vpshufd (_mm256_shuffle_epi32)
vpshufb (_mm256_shuffle_epi8)
vpshufhw (_mm256_shufflehi_epi16)
vpshuflw (_mm256_shufflelo_epi16)
vpslldq (_mm256_slli_si256)
vpsrldq (_mm256_srli_si256)
vpunpckhwd (_mm256_unpackhi_epi16)
vpunpckhdq (_mm256_unpackhi_epi32)
vpunpckhqdq (_mm256_unpackhi_epi64)
vpunpckhbw (_mm256_unpackhi_epi8)
vpunpcklwd (_mm256_unpacklo_epi16)
vpunpckldq (_mm256_unpacklo_epi32)
vpunpcklqdq (_mm256_unpacklo_epi64)
vpunpcklbw (_mm256_unpacklo_epi8)

For me the most surprising are packing instructions (vpack*) as they require additional shuffling (after or before the instruction) if we want to keep order of values. In some cases the order is crucial.