niedziela, 9 maja 2010

Branchless set mask if value greater or how to print hex values

Suppose we need to get mask when nonnegative argument is greater then some constant value; in other words, we want to evaluate following expression:

if x > const_n then
   mask := 0xffffffff;
   mask := 0x00000000;

Portable branchless solution:
  • choose magic number M := (1 << (k-1)) - 1 - n, where k is a bit position, for example 31 if we operate on 32-bit words
  • calculate R := x + M
  • k-th bit of R is set if x > n
  • fill mask with this bit - see note Fill word with selected bit

The key to understand this trick is binary form of M: 0111..1111zzzz, where z is 0 or 1 depending on n value. When x is greater then n, then x + M has form 1000..000zzzz, because carry bit propagate through series of ones to k-th position of result.

Real world example - branchless converting hex digit to ASCII (M=0x7ffffff6 for k=31 and n=9).

; input:    eax - hex digit
; output:   eax - ASCII letter (0-9, A-F or a-f)
; destroys: ebx

        andl 0xf, %eax
        leal 0x7ffffff6(%eax), %ebx     ; MSB(ebx)=1 when eax >= 10
        sarl $31, %ebx                  ; ebx - mask
        andl  $7, %ebx                  ; ebx = 7 when eax >= 10 (for A-F letters)
        ;andl $39, %ebx                 ; ebx = 39 when eax >= 10 (for a-f letters)
        leal '0'(%eax, %ebx), %eax      ; eax = '0' + eax + ebx => ASCII letter

It is also possible to convert 4 hex digits in parallel using similar algorithm, but input data have to be correctly prepared. Moreover generating mask requires 3 instructon and one extra register (in scalar version just one arithmetic shift). I guess it wont be fast on x86, maybe this approach would be good for SIMD code, where similar code transforms more bytes at once.

; input: eax - four hex digits in form [0a0b0c0d]
; output: eax - four ascii letters
; destroys: ebx, ecx

        leal 0x76767676(%eax), %ebx        ; MSB of each byte is set when corresponding eax byte is >= 10
                                           ; (here: 0x7f - 9 = 0x76)
        andl $0x80808080, %ebx
        movl %ebx, %ecx
        shrl    $7, %ebx
        subl %ebx, %ecx                    ; ecx - byte-wise mask
        ;andl $0x07070707, %ecx            ; for ASCII letters A-F
        andl $0x27272727, %ecx             ; for ASCII letters a-f
        leal 0x30303030(%eax, %ecx), %eax  ; ecx - four ascii letters

See also: SSSE3: printing hex values (weird use of PSHUFB instruction)

sobota, 1 maja 2010

Speedup reversing table of bytes

With help of BSWAP instruction or SSE instructions (PSHUFD, PSHUFLW, PSHUFHW) or SSSE3 instruction (PSHUFB) reversing table can be faster. Speedup depends on three factors:
  • table size: larger=faster
  • table address: aligned=faster/much faster (15.5 speedup - possible! see chart)
  • CPU type

Read full article