środa, 31 marca 2010

Transpose bits in byte using SIMD instructions

Method presented here allows to get any bit permutation, transposition is just one of possible operations. Lookup-based approach would be faster, but algorithm is worth to (re)show.

Algorithm outline for 8-byte vector (with SSE instruction it is possible to get 2 operations in parallel):
  1. fill vector with given byte
    [11010001] =>
    [11010001|11010001|11010001|11010001|11010001|11010001|11010001|11010001]
  2. leave one bit per byte
    [10000000|01000000|00000000|00010000|00000000|00000000|00000000|00000001]
  3. perform desired transposition ("move" bits around)
    [00000001|000000010|00000000|00001000|00000000|00000000|00000000|10000000]
  4. perform horizontal OR of all bytes
    [10001011]
Here is my old MMX code (polish text); below SSE/SSE5 implementation details.

Ad 1. Series of punpcklbw/punpcklwb/shufps or pshufb if CPU supports SSSE3.
# 1.1
movd %eax, %xmm0
shufps $0x00, %xmm0, %xmm0
punpcklbw %xmm0, %xmm0
punpcklwd %xmm0, %xmm0

# 1.2
pxor %xmm1, %xmm1
movd %eax, %xmm0
pshufb %xmm1, %xmm0
Ad 2. Simple pand with mask packed_qword(0x8040201008040201).
pand  MASK1, %xmm0
Ad 3. If plain SSE instructions are supported this step require some work. First each bit is populated to fill whole byte (using pcmpeq - we get negated result), then mask bits on desired positons.

SSE5 has powerful instruction protb that can do perform rotation of each byte with independent amount - so in this case just one instruction is needed.
# SSE
pcmpeq %xmm1, %xmm0
pandn MASK2, %xmm0

# SSE5
protb ROT, %xmm0, %xmm0
Ad 4. Since bits are placed on distinct positions, we can use instruction psadbw, that calculate horizontal sums of bytewide differences from two registers (separately for low and high registers half). If one register is full of zeros, we get sum of bytes from other register.
psadbw  %xmm1, %xmm0
movd %xmm0, %eax
Depending on instruction set three (SSE) or two (SSE5) additional tables are needed.

Brak komentarzy: