## niedziela, 11 kwietnia 2010

### Determining if an integer is a power of 2

Method from Bit Twiddling Hacks: (x != 0) && (x & (x-1) == 0). GCC compiles this to following code:
```; input/ouput: eax
; destroys: ebx

test    %eax,  %eax     ; x == 0?
jz      1f

leal -1(%eax), %ebx     ; ebx := x-1
test    %eax,  %ebx     ; ZF  := (eax & ebx == 0)

setz     %al
movzx    %al, %eax       ; eax := ZF
1:
```

We can use also BSF and BSR instructions, that determines position of first and last bit=1. If number is power of 2, then just one bit is set, and thus these positions are equal. BSx sets also ZF flag if input value is zero.

```; input/output: eax
; destroys: ebx, edx

bsf     %eax, %ebx      ; ebx := LSB's position if eax != 0, ZF = 1 if eax = 0
jz      1f
bsr     %eax, %edx      ; edx := MSB's position

cmp     %ebx, %edx      ; ZF  := (ebx = edx)

setz    %al
movzx   %al, %eax       ; eax := ZF
1:
```

## czwartek, 8 kwietnia 2010

### Brenchless conditional exchange

Suppose we have to exchange (or just move) two registers A and B:
1. C := A xor B
2. C := 0 if condition is not true
3. A := A xor C
4. B := B xor C
If C is 0, then  A and B left unchanged, else A and B are swapped. If only conditional move from B to A is needed, then step 4th have to be skipped.

Here is a sample x86 code, where condition is value of CF:
```sbb edx, edx ; part of step 2. - edx = 0xffffff if CF=1, 0x000000 otherwise
mov ecx, eax
xor ecx, ebx ; step 1
and ecx, edx ; completed step 2. - now C is 0 or (A xor B)
xor eax, ecx ; step 3
xor ebx, ecx ; step 4
```

Branchless moves are possible in Pentium Pro and higher with instructions cmovcc.

## sobota, 3 kwietnia 2010

### STL: map with string as key - access speedup

The idea is quite simple: we do not have a single stl::map<string, something>, but vector of maps, indexed with O(1) time - each map store keys sharing certain properties. Drawback: additional memory.

I've tested following grouping schemes:
1. length of string,
2. first letter of string (one level trie),
3. both length and first letter.
Third is the fastest - around 60% faster then plain std::map from gcc (red-black tree).

Tests: my program read plain text (I've used The Illiad from http://gutenberg.org), text is splitted into words (~190000) and then each words is inserted into dictonary (~28000 distinct words); then the same words are searched in dictionaries. Table below summarizes results on my computer (gcc 4.3.4 from cygwin).

```+-------------+--------------------------------+--------------------------------+
|             |          running time [ms]     |            speedup [%]         |
| data struct +----------+----------+----------+----------+----------+----------+
|             |    min   |   avg    |   max    |   min    |   avg    |   max    |
+-------------+----------+----------+----------+----------+----------+----------+
|                                          inserting                            |
+-------------+----------+----------+----------+----------+----------+----------+
|  std::map   |    269   |   287    |   355    |   100    |   100    |   100    |
+-------------+----------+----------+----------+----------+----------+----------+
| first char  |    218   |   241    |   395    |    81    |    84    |   111    |
+-------------+----------+----------+----------+----------+----------+----------+
|   length    |    218   |   240    |   345    |    81    |    84    |    97    |
+-------------+----------+----------+----------+----------+----------+----------+
| len./char   |    165   |   172    |   207    |    61    |    60    |    58    |
+-------------+----------+----------+----------+----------+----------+----------+
|                                            finding                            |
+-------------+----------+----------+----------+----------+----------+----------+
|  std::map   |    295   |   322    |   483    |   100    |   100    |   100    |
+-------------+----------+----------+----------+----------+----------+----------+
| first char  |    243   |   263    |   460    |    82    |    82    |    95    |
+-------------+----------+----------+----------+----------+----------+----------+
|   length    |    238   |   248    |   292    |    80    |    77    |    60    |
+-------------+----------+----------+----------+----------+----------+----------+
| len./char   |    184   |   190    |   241    |    62    |    60    |    50    |
+-------------+----------+----------+----------+----------+----------+----------+```

## czwartek, 1 kwietnia 2010

### Branchless signum

Problem: calculate value of sign(x):
• -1 when x < 0
• 0 when x = 0,
• +1 when x > 0.
My solution do not involve any hardware specific things like ALU flags nor special instructions - just plain AND, OR, shifts.
``` ; input: eax = X

movl %eax, %ebx
sarl \$31, %eax  // eax = -1 if X less then zero, 0 otherwise

andl \$0x7fffffff, %ebx
addl \$0x7fffffff, %ebx // MSB is set if any lower bits were set
shrl \$31, \$ebx  // eax = +1 if X greater then zero, 0 otherwise

orl %ebx, %eax  // eax = result
```

C99 implementation:
```int32_t sign(int32_t x) {
int32_t y;
y = (x & 0x7fffffff) + 0x7fffffff;
return (x >> 31) | ((uint32_t)y >> 31);
}
```

### Fill word with selected bit

This is continuation of subproblem from previous post: we have a word (byte, dword, whatever) and want to fill it with selected bit.

1. The most general algorithm:
 => 
2. clone word
a=, b=
3. shift bit in first word to MSB, and to LSB in second word
a=, b=
4. subtract c = a - b
c=
5. add missing MSB c = c OR a
c=
2. If arithmetic shifts are supported:
1. shift bit to MSB
a=
2. arithmetic shift right
a=
3. On processor 386+ we can copy selected bit from reg to carry flag (CF) with bt reg, reg and then clone CF with sbb reg, reg.
```#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>

uint32_t fill1(uint32_t a, int bit) {
uint32_t b;

b = a = a & (1 << bit);

a <<= 31 - bit;
b >>= bit;

return (a - b) | a;
}

uint32_t fill2(uint32_t a, int bit) {
a <<= 31 - bit;
return (int32_t)(a) >> 31;
}

uint32_t fill386(uint32_t a, int bit) {
uint32_t result;
__asm__ __volatile__ (
"bt  %1, %0\n"
"sbb %0, %0\n"
: "=r" (result)
: "r" (bit), "0" (a)
);
return result;
}

int main(int argc, char* argv[]) {
uint32_t x, i;
if (argc > 1)
x = (unsigned long)strtol(argv, NULL, 0);

printf("input = %08x\n", x);
for (i=0; i < 32; i++)
printf("bit %2d: fill1 = %08x, fill2 = %08x, fill386 = %08x\n",
i,
fill1(x, i),
fill2(x, i),
fill386(x, i)
);

return EXIT_SUCCESS;
}
```