sobota, 22 marca 2014

C++ bitset vs array

C++ bitset conserves a memory, but at cost of speed access. Bitset must be slower than set represented as a plain old array, at least when sets are small (say few hundred elements).

Lets look at this simple functions:

// set_test.cpp 
#include <stdint.h>
#include <bitset>

const int size = 128;

typedef uint8_t byte_set[size];

bool any_in_byteset(uint8_t* data, size_t size, byte_set set) {
 for (auto i=0u; i < size; i++)
  if (set[data[i]])
   return true;
 return false;

typedef std::bitset<size> bit_set;

bool any_in_bitset(uint8_t* data, size_t size, bit_set set) {
 for (auto i=0u; i < size; i++)
  if (set[data[i]])
   return true;
 return false;

The file was compiled with g++ -std=c++11 -O3 set_test.cpp; Assembly code of the core of any_in_byteset:

  28: 0f b6 10              movzbl (%eax),%edx
  2b: 83 c0 01              add    $0x1,%eax
  2e: 80 3c 11 00           cmpb   $0x0,(%ecx,%edx,1)
  32: 75 0c                 jne    40
  34: 39 d8                 cmp    %ebx,%eax
  36: 75 f0                 jne    28
Statement if (set[data[i]]) return true are lines 28, 2e and 32, i.e.: load from memory, compare and jump. Instructions 2b, 34 and 36 handles the for loop.

Now, look at assembly code of any_in_bitset:

  5f: 0f b6 13              movzbl (%ebx),%edx
  62: b8 01 00 00 00        mov    $0x1,%eax
  67: 89 d1                 mov    %edx,%ecx
  69: 83 e1 1f              and    $0x1f,%ecx
  6c: c1 ea 05              shr    $0x5,%edx
  6f: d3 e0                 shl    %cl,%eax
  71: 85 44 94 18           test   %eax,0x18(%esp,%edx,4)
  75: 75 39                 jne    b0

All these instructions implements the if statement! Again we have load from memory (5f), but checking which bit is set require much more work. Input (edx) is split to lower part --- i.e. bit number (67, 6c) and higher part --- i.e. word index (6c). Last step is to check if a bit is set in a word --- GCC used variable shift left (6f), but x86 has BT instruction, so in a perfect code we would have 2 instructions less.

However, as we see simple access in the bitset is much more complicated than simple memory fetch from byteset. For short sets memory fetches are well cached and smaller number of instruction improves performance. For really large set cache misses would kill performance, then bitset is much better choice.

środa, 19 marca 2014

Is const-correctness paranoia good?

Yes, definitely. Lets see this simple example:
$ cat test.cpp
int test(int x) {
 if (x = 1)
  return 42;
  return 0;
$ g++ -c test.cpp
$ g++ -c -Wall test.cpp
int test(int x) {
 if (x = 1)
  return 42;
  return 0;
Only when we turn on warnings, compiler tell us about a possible error. Making the parameter const shows us error:
$ cat test2.cpp
int test(int x) {
 if (x = 1)
  return 42;
  return 0;
$ g++ -c test.cpp
test2.cpp: In function ‘int test(int)’:
test2.cpp:2:8: error: assignment of read-only parameter ‘x’
  if (x = 1)
All input parameters should be const, all write-once variables serving as a parameters for some computations should be also const.

Quick and dirty ad-hoc git hosting

Recently I needed to synchronize my local repository with a remote machine, just for full backup. It's really simple if you have standard Linux tools (Cygwin works too, of course).

1. in a working directory run:
$ pwd
$ git update-server-info

2. in a parent directory start HTTP server:
$ cd ..
$ pwd
$ python -m SimpleHTTPServer
Serving HTTP on port 8000 ...

3. on a remote machine clone/pull/whatever:
$ git clone http://your_ip:8000/project/.git
Step 1 have to be executed manually when local repository has changed.

niedziela, 16 marca 2014

Scalar version of SSE move mask instruction

SSE instruction PMOVMSKB gathers all most significant bits from bytes and stores them as a single 16-bit value; similar action is performed by MOVMSKPD and MOVMSKPS.

Such operation could be easily done using scalar multiplication. Read more ...

Asymmetric numeral systems

wtorek, 11 marca 2014

C++ standard inaccuracy

First we read:
21.4.1 basic_string general requirements [string.require]
3 No erase() or pop_back() member function shall throw any exceptions.
... few pages later: basic_string::erase [string::erase]
Allocator>& erase(size_type pos = 0, size_type n = npos);
2 Throws: out_of_range if pos > size().

SIMD-friendly Rabin-Karp modification

Rabin-Karp algorithm uses a weak hash function to locate possible substring positions. This modification uses merely equality of first and last char of searched substring, however equality of chars can be done very fast in parallel, even without SIMD instruction. Read more ...

niedziela, 9 marca 2014

GCC - asm goto

Starting from GCC 4.5 the asm statement has new form: asm goto. Programmer can use any labels from C/C++ program, however output from this block is not allowed.

Using asm block in old form required more instructions and additional storage:
 uint32_t bit_set;
 asm (
  "bt %2, %%eax  \n"
  "setc %%al  \n"
  "movzx %%al, %%eax \n"
  : "=a" (bit_set)
  : "a" (number), "r" (bit)
  : "cc"

 if (bit_set)
  goto has_bit;
Above code is compiled to:
 80483f6:       0f a3 d8                bt     %ebx,%eax
 80483f9:       0f 92 c0                setb   %al
 80483fc:       0f b6 c0                movzbl %al,%eax
 80483ff:       85 c0                   test   %eax,%eax
 8048401:       74 16                   je     8048419 
With asm goto the same task could be writting shorter and easier:
 asm goto (
  "bt %1, %0 \n"
  "jc %l[has_bit] \n"

  : /* no output */
  : "r" (number), "r" (bit)
  : "cc"
  : has_bit // name of label
Complete demo is available. See also GCC documentation about Extended Asm.

Integer log 10 of unsigned integer - SIMD version

Fast calculate ceil(log10(x)) of some unsigned number is described on Bit Twiddling Hacks, this text show the SIMD solution for 32-bit numbers.


1. populate value in XMM registers. Since maximum value of this function is 10 we need three registers.
movd   %eax, %xmm0          // xmm0 = packed_dword(0, 0, 0, x)
pshufd $0, %xmm0, %xmm0 \n" // xmm0 = packed_dword(x, x, x, x)
movapd %xmm0, %xmm1
movapd %xmm0, %xmm2
2. compare these numbers with sequence of powers of 10.
// powers_a = packed_dword(10^1 - 1, 10^2 - 1, 10^3 - 1, 10^4 - 1)
// powers_c = packed_dword(10^5 - 1, 10^6 - 1, 10^7 - 1, 10^8 - 1)
// powers_c = packed_dword(10^9 - 1, 0, 0, 0)
pcmpgtd powers_a, %xmm0
pcmpgtd powers_b, %xmm1
pcmpgtd powers_c, %xmm2
result of comparisons are: 0 (false) or -1 (true), for example:
xmm0 = packed_dword(-1, -1, -1, -1)
xmm1 = packed_dword( 0, 0, -1, -1)
xmm2 = packed_dword( 0, 0, 0, 0)
3. calculate sum of all dwords
psrld $31, %xmm0       // xmm0 = packed_dword( 1, 1, 1, 1) - convert -1 to 1
psubd %xmm1, %xmm0     // xmm0 = packed_dword( 1, 1, 2, 2)
psubd %xmm2, %xmm0     // xmm0 = packed_dword( 1, 1, 2, 2)

// convert packed_dword to packed_word
pxor %xmm1, %xmm1
packssdw %xmm1, %xmm0 // xmm0 = packed_word(0, 0, 0, 0, 1, 1, 2, 2)

// max value of word in xmm0 is 3, so higher
// bytes are always zero
psadbw %xmm1, %xmm0   // xmm0 = packded_qword(0, 6)
4. save result, i.e. the lowest dword
movd %xmm0, %eax      // eax = 6

Sample program is available.

Mask for zero/non-zero bytes

The description of Determine if a word has a zero byte from "Bit Twiddling Hacks" says about haszero(v): "the result is the high bits set where the bytes in v were zero".

Unfortunately this is not true. High bits are also set for ones followed zeros, i.e. haszero(0xff010100) = 0x00808080. Of course the result is still valid (non-zero if there were any zero byte), but if we want to iterate over all zeros or find last zero index, this could be a problem.

It's possible to create exact mask:
uint32_t nonzeromask(const uint32_t v) {
    // MSB are set if any of 7 lowest bits are set
    const uint32_t nonzero_7bit = (v & 0x7f7f7f7f) + 0x7f7f7f7f;

    return (v | nonzero_7bit) & 0x80808080;

uint32_t zeromask(const uint32_t v) {
     // negate MSBs
     return (nonzeromask(v)) ^ 0x80808080;

Function nonzeromask requires 4 simple instructions, and zeromask one additional xor.

poniedziałek, 3 marca 2014

Slow-paths in GNU libc strstr

I've observed that some patterns issued to strstr cause significant slowdown.

Sample program kill-strstr.c executes strstr(data, pattern), where data is a large string (16MB) filled with character ?; patterns are read from command line.

On my machine following times were recorded:

1. searching string 'johndoe'...
        time: 0.032
2. searching string '??????????????????a'...
        time: 0.050
3. searching string '??????????????????????????????a'...
        time: 0.049
4. searching string '???????????????????????????????a'...
        time: 0.274
5. searching string '??????????????????????????????a?'...
        time: 0.356
6. searching string '??????????????????????????????a??????????????????????????????'...
        time: 0.396
  • Slowdown is visible in case 4 (5 times slower than pattern 3). Pattern has 32 characters, and contains '?', except last char.
  • Even bigger slowdown occurs in case 5 (7 times slower). This pattern also contains 32 chars, but position of the single letter 'a' is last but one.
  • Similar slowdown occurs in case 5 (nearly 8 times slower). In this pattern single letter 'a' is surrounded by 30 '?'.