niedziela, 11 kwietnia 2010

Determining if an integer is a power of 2

Method from Bit Twiddling Hacks: (x != 0) && (x & (x-1) == 0). GCC compiles this to following code:
; input/ouput: eax
; destroys: ebx

        test    %eax,  %eax     ; x == 0?
        jz      1f

        leal -1(%eax), %ebx     ; ebx := x-1
        test    %eax,  %ebx     ; ZF  := (eax & ebx == 0)

        setz     %al
        movzx    %al, %eax       ; eax := ZF
1:

We can use also BSF and BSR instructions, that determines position of first and last bit=1. If number is power of 2, then just one bit is set, and thus these positions are equal. BSx sets also ZF flag if input value is zero.

; input/output: eax
; destroys: ebx, edx

        bsf     %eax, %ebx      ; ebx := LSB's position if eax != 0, ZF = 1 if eax = 0
        jz      1f
        bsr     %eax, %edx      ; edx := MSB's position

        cmp     %ebx, %edx      ; ZF  := (ebx = edx)

        setz    %al
        movzx   %al, %eax       ; eax := ZF
1:

czwartek, 8 kwietnia 2010

Brenchless conditional exchange

Suppose we have to exchange (or just move) two registers A and B:
  1. C := A xor B
  2. C := 0 if condition is not true
  3. A := A xor C
  4. B := B xor C
If C is 0, then  A and B left unchanged, else A and B are swapped. If only conditional move from B to A is needed, then step 4th have to be skipped.

Here is a sample x86 code, where condition is value of CF:
sbb edx, edx ; part of step 2. - edx = 0xffffff if CF=1, 0x000000 otherwise
mov ecx, eax
xor ecx, ebx ; step 1
and ecx, edx ; completed step 2. - now C is 0 or (A xor B)
xor eax, ecx ; step 3
xor ebx, ecx ; step 4

Branchless moves are possible in Pentium Pro and higher with instructions cmovcc.

See also XOR linked list.

sobota, 3 kwietnia 2010

STL: map with string as key - access speedup

The idea is quite simple: we do not have a single stl::map<string, something>, but vector of maps, indexed with O(1) time - each map store keys sharing certain properties. Drawback: additional memory.

I've tested following grouping schemes:
  1. length of string,
  2. first letter of string (one level trie),
  3. both length and first letter.
Third is the fastest - around 60% faster then plain std::map from gcc (red-black tree).

Tests: my program read plain text (I've used The Illiad from http://gutenberg.org), text is splitted into words (~190000) and then each words is inserted into dictonary (~28000 distinct words); then the same words are searched in dictionaries. Table below summarizes results on my computer (gcc 4.3.4 from cygwin).

+-------------+--------------------------------+--------------------------------+
|             |          running time [ms]     |            speedup [%]         |
| data struct +----------+----------+----------+----------+----------+----------+
|             |    min   |   avg    |   max    |   min    |   avg    |   max    |
+-------------+----------+----------+----------+----------+----------+----------+
|                                          inserting                            |
+-------------+----------+----------+----------+----------+----------+----------+
|  std::map   |    269   |   287    |   355    |   100    |   100    |   100    |
+-------------+----------+----------+----------+----------+----------+----------+
| first char  |    218   |   241    |   395    |    81    |    84    |   111    |
+-------------+----------+----------+----------+----------+----------+----------+
|   length    |    218   |   240    |   345    |    81    |    84    |    97    |
+-------------+----------+----------+----------+----------+----------+----------+
| len./char   |    165   |   172    |   207    |    61    |    60    |    58    |
+-------------+----------+----------+----------+----------+----------+----------+
|                                            finding                            |
+-------------+----------+----------+----------+----------+----------+----------+
|  std::map   |    295   |   322    |   483    |   100    |   100    |   100    |
+-------------+----------+----------+----------+----------+----------+----------+
| first char  |    243   |   263    |   460    |    82    |    82    |    95    |
+-------------+----------+----------+----------+----------+----------+----------+
|   length    |    238   |   248    |   292    |    80    |    77    |    60    |
+-------------+----------+----------+----------+----------+----------+----------+
| len./char   |    184   |   190    |   241    |    62    |    60    |    50    |
+-------------+----------+----------+----------+----------+----------+----------+

Download test program.

czwartek, 1 kwietnia 2010

Branchless signum

Problem: calculate value of sign(x):
  • -1 when x < 0
  • 0 when x = 0,
  • +1 when x > 0.
My solution do not involve any hardware specific things like ALU flags nor special instructions - just plain AND, OR, shifts.
 ; input: eax = X

 movl %eax, %ebx
 sarl $31, %eax  // eax = -1 if X less then zero, 0 otherwise

 andl $0x7fffffff, %ebx
 addl $0x7fffffff, %ebx // MSB is set if any lower bits were set
 shrl $31, $ebx  // eax = +1 if X greater then zero, 0 otherwise

 orl %ebx, %eax  // eax = result

C99 implementation:
int32_t sign(int32_t x) {
 int32_t y;
 y = (x & 0x7fffffff) + 0x7fffffff;
 return (x >> 31) | ((uint32_t)y >> 31);
}

Fill word with selected bit

This is continuation of subproblem from previous post: we have a word (byte, dword, whatever) and want to fill it with selected bit.

1. The most general algorithm:
  1. mask bit
    [10111010] => [00010000]
  2. clone word
    a=[00010000], b=[00010000]
  3. shift bit in first word to MSB, and to LSB in second word
    a=[10000000], b=[00000001]
  4. subtract c = a - b
    c=[01111111]
  5. add missing MSB c = c OR a
    c=[11111111]
2. If arithmetic shifts are supported:
  1. shift bit to MSB
    a=[10000000]
  2. arithmetic shift right
    a=[11111111]
3. On processor 386+ we can copy selected bit from reg to carry flag (CF) with bt reg, reg and then clone CF with sbb reg, reg.
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>

uint32_t fill1(uint32_t a, int bit) {
uint32_t b;

b = a = a & (1 << bit);

a <<= 31 - bit;
b >>= bit;

return (a - b) | a;
}

uint32_t fill2(uint32_t a, int bit) {
a <<= 31 - bit;
return (int32_t)(a) >> 31;
}

uint32_t fill386(uint32_t a, int bit) {
uint32_t result;             
__asm__ __volatile__ (
"bt  %1, %0\n"
"sbb %0, %0\n"
: "=r" (result)
: "r" (bit), "0" (a)
);
return result;
}

int main(int argc, char* argv[]) {
uint32_t x, i;
if (argc > 1)
x = (unsigned long)strtol(argv[1], NULL, 0);

printf("input = %08x\n", x);
for (i=0; i < 32; i++)
printf("bit %2d: fill1 = %08x, fill2 = %08x, fill386 = %08x\n",
i,
fill1(x, i),
fill2(x, i),
fill386(x, i)
);

return EXIT_SUCCESS;
}