Algorithm outline for 8-byte vector (with SSE instruction it is possible to get 2 operations in parallel):
- fill vector with given byte
[11010001] =>
[11010001|11010001|11010001|11010001|11010001|11010001|11010001|11010001] - leave one bit per byte
[10000000|01000000|00000000|00010000|00000000|00000000|00000000|00000001] - perform desired transposition ("move" bits around)
[00000001|000000010|00000000|00001000|00000000|00000000|00000000|10000000] - perform horizontal OR of all bytes
[10001011]
Ad 1. Series of punpcklbw/punpcklwb/shufps or pshufb if CPU supports SSSE3.
# 1.1Ad 2. Simple pand with mask packed_qword(0x8040201008040201).
movd %eax, %xmm0
shufps $0x00, %xmm0, %xmm0
punpcklbw %xmm0, %xmm0
punpcklwd %xmm0, %xmm0
# 1.2
pxor %xmm1, %xmm1
movd %eax, %xmm0
pshufb %xmm1, %xmm0
pand MASK1, %xmm0Ad 3. If plain SSE instructions are supported this step require some work. First each bit is populated to fill whole byte (using pcmpeq - we get negated result), then mask bits on desired positons.
SSE5 has powerful instruction protb that can do perform rotation of each byte with independent amount - so in this case just one instruction is needed.
# SSEAd 4. Since bits are placed on distinct positions, we can use instruction psadbw, that calculate horizontal sums of bytewide differences from two registers (separately for low and high registers half). If one register is full of zeros, we get sum of bytes from other register.
pcmpeq %xmm1, %xmm0
pandn MASK2, %xmm0
# SSE5
protb ROT, %xmm0, %xmm0
psadbw %xmm1, %xmm0Depending on instruction set three (SSE) or two (SSE5) additional tables are needed.
movd %xmm0, %eax