poniedziałek, 13 kwietnia 2015
Speeding up bit-parallel population count
Nearly 50% faster than naive version for large data sets. Discovered by accident. :)
czwartek, 9 kwietnia 2015
Github repositories
I've put source code for my two articles at github:
BTW the article about popcount has gained popularity, and I hope another crazy idea about hacking MPSADBW will spread all over the world.
- SSSE3: fast popcount --- repository
- SSE4 string search - modification of Karp-Rabin algorithm --- repository
BTW the article about popcount has gained popularity, and I hope another crazy idea about hacking MPSADBW will spread all over the world.
SIMD-ized searching in unique constant dictionary
The problem: there is a ordered dictionary containing only
unique keys. Dictionary is read only, and keys are 32-bit (SSE) or
64-bit (AVX2). Read more
niedziela, 22 marca 2015
SIMD: detecting a bit pattern
The problem: there are 64-bit values with some data bits and some metadata bits; metadata includes a k-bit field describing a "type" (k >= 0). Type field is located in a lower 32-bits.
Procedure processes two "types", one denoted with code 3 and another with 5. When all items are of type 3 then we can use a fast AVX2 path, if there are some types 5, we have to call an additional function (a virtual method, to be precise). Read more ...
Procedure processes two "types", one denoted with code 3 and another with 5. When all items are of type 3 then we can use a fast AVX2 path, if there are some types 5, we have to call an additional function (a virtual method, to be precise). Read more ...
Compiler warnings are your future errors
Months ago I was asked to upgrade GCC from version 4.7 to 4.9 and also cleanup configure scripts. Not very exciting, merely time consuming task. Oh, we were hit by a bug in libstdc++, but simple patch have fixed the problem. Few weeks later I was asked to change GCC switch from -std=c++11 to -std=c++14 -- the easiest task in the world. I had to modify single script, run configure, type make, then run tests... everything was OK. Quite boring so far. Read more ...
AVX512: ternary functions evaluation
Intel's version of SIMD offers following 2-argument (binary) boolean functions: and, or, xor, and not. There isn't a single argument not, this function can be expressed with xor reg, ones, however this require additional, pre-set register.
AVX512F will come with very interesting instruction called vpternlog. Read more ...
AVX512F will come with very interesting instruction called vpternlog. Read more ...
sobota, 21 marca 2015
Not everything in AVX2 is 256-bit
AVX2 has added support for 256-bit arguments for many operations on packed integers, although not all. Some instructions accept the 256-bit registers, but operates on 128-bit lanes rather whole register.
There are three major groups of instructions: packing (narrowing conversion), unpacking (interleave) and permutations; below is a full list of instructions (with intrinsics):
There are three major groups of instructions: packing (narrowing conversion), unpacking (interleave) and permutations; below is a full list of instructions (with intrinsics):
- valignr (_mm256_alignr_epi8)
- vpslldq (_mm256_bslli_epi128)
- vpsrldq (_mm256_bsrli_epi128)
- vmpsadbw (_mm256_mpsadbw_epu8)
- vpacksswb (_mm256_packs_epi16)
- vpackssdw (_mm256_packs_epi32)
- vpackuswb (_mm256_packus_epi16)
- vpackusdw (_mm256_packus_epi32)
- vperm2i128 (_mm256_permute2x128_si256)
- vpermq (_mm256_permute4x64_epi64)
- vpermpd (_mm256_permute4x64_pd)
- vpshufd (_mm256_shuffle_epi32)
- vpshufb (_mm256_shuffle_epi8)
- vpshufhw (_mm256_shufflehi_epi16)
- vpshuflw (_mm256_shufflelo_epi16)
- vpslldq (_mm256_slli_si256)
- vpsrldq (_mm256_srli_si256)
- vpunpckhwd (_mm256_unpackhi_epi16)
- vpunpckhdq (_mm256_unpackhi_epi32)
- vpunpckhqdq (_mm256_unpackhi_epi64)
- vpunpckhbw (_mm256_unpackhi_epi8)
- vpunpcklwd (_mm256_unpacklo_epi16)
- vpunpckldq (_mm256_unpacklo_epi32)
- vpunpcklqdq (_mm256_unpacklo_epi64)
- vpunpcklbw (_mm256_unpacklo_epi8)
Subskrybuj:
Posty (Atom)