This document presents the full implementation details of BT-trees, a highly efficient ordered map, and an evaluation which compares BT-trees with unordered maps. BT- trees are often much faster than other ordered maps, and have comparable performance to unordered map implementations. However, in benchmarks which favor unordered maps, BT-trees are not faster than the fastest unordered map implementations we know of.
Pokazywanie postów oznaczonych etykietą data structures. Pokaż wszystkie posty
Pokazywanie postów oznaczonych etykietą data structures. Pokaż wszystkie posty
sobota, 20 czerwca 2015
Implementation of BT-trees
Great paper by Lars F. Bonnichsen, Christian W. Probst, Sven Karlsson:
środa, 20 maja 2015
Optimizing Dijkstra for real-world performance
Another interesting paper:
Our experimental results currently put our prototype implementation at about twice as fast as the Boost implementation of the algorithm on both real-world and generated large graphs. Furthermore, this preliminary implementation was written in only a few weeks, by a single programmer. The fact that such an early prototype compares favorably against Boost, a well-known open source library developed by expert programmers, gives us reason to believe our design for the queue is indeed better suited to the problem at hand, and the favorable time measurements are not a product of any specific implementation technique we employed.
niedziela, 16 listopada 2014
Speeding up searching in linked list
Sounds crazy, but it's possible in some cases. Here are experiments results - 3 times faster isn't so bad.
list : 0.780s, speedup 1.00 array list (4) : 0.703s, speedup 1.11 array list (8) : 0.515s, speedup 1.51 SIMD array list (4) : 0.365s, speedup 2.14 SIMD array list (8) : 0.258s, speedup 3.03
sobota, 22 marca 2014
C++ bitset vs array
C++ bitset conserves a memory, but at cost of speed access. Bitset must be slower than set represented as a plain old array, at least when sets are small (say few hundred elements).
Lets look at this simple functions:
The file was compiled with g++ -std=c++11 -O3 set_test.cpp; Assembly code of the core of any_in_byteset:
Now, look at assembly code of any_in_bitset:
All these instructions implements the if statement! Again we have load from memory (5f), but checking which bit is set require much more work. Input (edx) is split to lower part --- i.e. bit number (67, 6c) and higher part --- i.e. word index (6c). Last step is to check if a bit is set in a word --- GCC used variable shift left (6f), but x86 has BT instruction, so in a perfect code we would have 2 instructions less.
However, as we see simple access in the bitset is much more complicated than simple memory fetch from byteset. For short sets memory fetches are well cached and smaller number of instruction improves performance. For really large set cache misses would kill performance, then bitset is much better choice.
Lets look at this simple functions:
// set_test.cpp
#include <stdint.h>
#include <bitset>
const int size = 128;
typedef uint8_t byte_set[size];
bool any_in_byteset(uint8_t* data, size_t size, byte_set set) {
for (auto i=0u; i < size; i++)
if (set[data[i]])
return true;
return false;
}
typedef std::bitset<size> bit_set;
bool any_in_bitset(uint8_t* data, size_t size, bit_set set) {
for (auto i=0u; i < size; i++)
if (set[data[i]])
return true;
return false;
}
The file was compiled with g++ -std=c++11 -O3 set_test.cpp; Assembly code of the core of any_in_byteset:
28: 0f b6 10 movzbl (%eax),%edx 2b: 83 c0 01 add $0x1,%eax 2e: 80 3c 11 00 cmpb $0x0,(%ecx,%edx,1) 32: 75 0c jne 40 34: 39 d8 cmp %ebx,%eax 36: 75 f0 jne 28Statement if (set[data[i]]) return true are lines 28, 2e and 32, i.e.: load from memory, compare and jump. Instructions 2b, 34 and 36 handles the for loop.
Now, look at assembly code of any_in_bitset:
5f: 0f b6 13 movzbl (%ebx),%edx 62: b8 01 00 00 00 mov $0x1,%eax 67: 89 d1 mov %edx,%ecx 69: 83 e1 1f and $0x1f,%ecx 6c: c1 ea 05 shr $0x5,%edx 6f: d3 e0 shl %cl,%eax 71: 85 44 94 18 test %eax,0x18(%esp,%edx,4) 75: 75 39 jne b0
All these instructions implements the if statement! Again we have load from memory (5f), but checking which bit is set require much more work. Input (edx) is split to lower part --- i.e. bit number (67, 6c) and higher part --- i.e. word index (6c). Last step is to check if a bit is set in a word --- GCC used variable shift left (6f), but x86 has BT instruction, so in a perfect code we would have 2 instructions less.
However, as we see simple access in the bitset is much more complicated than simple memory fetch from byteset. For short sets memory fetches are well cached and smaller number of instruction improves performance. For really large set cache misses would kill performance, then bitset is much better choice.
Subskrybuj:
Posty (Atom)