[...] as DRAM process technology scales down to smaller dimensions, it becomes more difficult to prevent DRAM cells from electrically interacting with each other. In this paper, we expose the vulnerability of commodity DRAM chips to disturbance errors. By reading from the same address in DRAM, we show that it is possible to corrupt data in nearby addresses.This is must-read paper.
wtorek, 23 grudnia 2014
Flipping Bits in Memory Without Accessing Them
Quote from paper's abstract:
niedziela, 30 listopada 2014
Conversion int and double to string - comparison
Milo Yip has compared different itoa and dtoa implementations on Core i7, including my itoa algorithm 2, that use SSE2 instructions.
Results for itoa are interesting: SSE2 version is not as good as it seemed to be. Tricky branchlut algorithm is only 10% slower, moreover is perfectly portable. One obvious drawback of this method is using lookup-table - in real environment where is a big pressure on cache, memory access could be a bottleneck.
Results for itoa are interesting: SSE2 version is not as good as it seemed to be. Tricky branchlut algorithm is only 10% slower, moreover is perfectly portable. One obvious drawback of this method is using lookup-table - in real environment where is a big pressure on cache, memory access could be a bottleneck.
Etykiety:
algorithms,
assembler,
floating point,
SIMD,
sse2
sobota, 22 listopada 2014
Simple Testing Can Prevent Most Critical Failures
I recommend very interesting paper. Authors studied many errors in complicated distributed systems, like Cassandra, and found that majority of failures are caused by trivial errors (some of them can be detected even in unit tests). Here is very interesting quote:
almost all (92%) of the catastrophic system failures are the result of incorrect handling of non-fatal errors explicitly signaled in software.
In my opinion causes of errors spotted in the study may apply to any kind of software.
almost all (92%) of the catastrophic system failures are the result of incorrect handling of non-fatal errors explicitly signaled in software.
In my opinion causes of errors spotted in the study may apply to any kind of software.
niedziela, 16 listopada 2014
Speeding up searching in linked list
Sounds crazy, but it's possible in some cases. Here are experiments results - 3 times faster isn't so bad.
list : 0.780s, speedup 1.00 array list (4) : 0.703s, speedup 1.11 array list (8) : 0.515s, speedup 1.51 SIMD array list (4) : 0.365s, speedup 2.14 SIMD array list (8) : 0.258s, speedup 3.03
piątek, 14 listopada 2014
MSVC 2013 Release code
Today I had a quite long session with debugger and release code, you know: no debugger symbols and optimized code. I spotted two pieces of assembly that forced me to check if compilation mode was really set to release.
First dl is stored in memory. It's ok. Then edx is used as an offset. Also ok. Seems that compiler knows that highest bits of edx are zeros i.e. edx = dl. Not so fast - edx is reloaded with it's original value and then eax is populated with the same value. These two movzx could be replaced with single mov eax, edx.
And another:
Yes, load & store the same value. Completely useless! Moreover, xmm6 isn't used in following code. (It's worth noting that load is done by an FP-unit and store by integer unit, inter-unit transfers cost one additional cycle on some processor.)
Above instruction sequences were produced by the newest compiler from Microsoft (MSVC 2013, update 3).
00000000000000D9 mov byte ptr [rcx+1Fh],dl 00000000000000DC mov byte ptr [rdx+rcx],0 00000000000000E0 movzx edx,byte ptr [rcx+1Fh] 00000000000000E4 movzx eax,dl
First dl is stored in memory. It's ok. Then edx is used as an offset. Also ok. Seems that compiler knows that highest bits of edx are zeros i.e. edx = dl. Not so fast - edx is reloaded with it's original value and then eax is populated with the same value. These two movzx could be replaced with single mov eax, edx.
And another:
00000000000000BF movaps xmm6,xmmword ptr [foo] 00000000000000C4 movdqa xmmword ptr [foo],xmm6
Yes, load & store the same value. Completely useless! Moreover, xmm6 isn't used in following code. (It's worth noting that load is done by an FP-unit and store by integer unit, inter-unit transfers cost one additional cycle on some processor.)
Above instruction sequences were produced by the newest compiler from Microsoft (MSVC 2013, update 3).
piątek, 26 września 2014
Fork bomb in Windows
Intepolation search revisited
Interpolation search revisited -- theoretical complexity of interpolation search is very promising, but...
niedziela, 21 września 2014
Conversion number to hexadecimal representation
Conversion numbers to hexadecimal representation - SWAR, plain SSE, and draft of BMI2 implementation.
Article SSSE3: printing hex values describes the same topic but is limited to exploit PSHUFB.
Article SSSE3: printing hex values describes the same topic but is limited to exploit PSHUFB.
czwartek, 18 września 2014
String literals are weird (at least in C/C++)
Simple quiz: what is the length of this string "\xbadcafe"?
- 5 letters
- 4 letters
- 2 letters
- 1 letter
czwartek, 11 września 2014
Python: rename file in Windows
Observation: function os.remove under Windows doesn't allow to overwrite an existing file. Python from Cygwin works properly.
Here is a workaround:
Here is a workaround:
def rename_file(old, new): if sys.platform != 'win32': os.rename(new, old) else: os.remove(old) os.rename(new, old)
Conversion numbers to binary representation
New article Conversion numbers to binary representation — SIMD & SWAR versions.
Few years ago I've described an MMX variant of SIMD algorithm. But the text was in Polish, so audience was limited.
Few years ago I've described an MMX variant of SIMD algorithm. But the text was in Polish, so audience was limited.
sobota, 22 marca 2014
C++ bitset vs array
C++ bitset conserves a memory, but at cost of speed access. Bitset must be slower than set represented as a plain old array, at least when sets are small (say few hundred elements).
Lets look at this simple functions:
The file was compiled with g++ -std=c++11 -O3 set_test.cpp; Assembly code of the core of any_in_byteset:
Now, look at assembly code of any_in_bitset:
All these instructions implements the if statement! Again we have load from memory (5f), but checking which bit is set require much more work. Input (edx) is split to lower part --- i.e. bit number (67, 6c) and higher part --- i.e. word index (6c). Last step is to check if a bit is set in a word --- GCC used variable shift left (6f), but x86 has BT instruction, so in a perfect code we would have 2 instructions less.
However, as we see simple access in the bitset is much more complicated than simple memory fetch from byteset. For short sets memory fetches are well cached and smaller number of instruction improves performance. For really large set cache misses would kill performance, then bitset is much better choice.
Lets look at this simple functions:
// set_test.cpp
#include <stdint.h> #include <bitset> const int size = 128; typedef uint8_t byte_set[size]; bool any_in_byteset(uint8_t* data, size_t size, byte_set set) { for (auto i=0u; i < size; i++) if (set[data[i]]) return true; return false; } typedef std::bitset<size> bit_set; bool any_in_bitset(uint8_t* data, size_t size, bit_set set) { for (auto i=0u; i < size; i++) if (set[data[i]]) return true; return false; }
The file was compiled with g++ -std=c++11 -O3 set_test.cpp; Assembly code of the core of any_in_byteset:
28: 0f b6 10 movzbl (%eax),%edx 2b: 83 c0 01 add $0x1,%eax 2e: 80 3c 11 00 cmpb $0x0,(%ecx,%edx,1) 32: 75 0c jne 40 34: 39 d8 cmp %ebx,%eax 36: 75 f0 jne 28Statement if (set[data[i]]) return true are lines 28, 2e and 32, i.e.: load from memory, compare and jump. Instructions 2b, 34 and 36 handles the for loop.
Now, look at assembly code of any_in_bitset:
5f: 0f b6 13 movzbl (%ebx),%edx 62: b8 01 00 00 00 mov $0x1,%eax 67: 89 d1 mov %edx,%ecx 69: 83 e1 1f and $0x1f,%ecx 6c: c1 ea 05 shr $0x5,%edx 6f: d3 e0 shl %cl,%eax 71: 85 44 94 18 test %eax,0x18(%esp,%edx,4) 75: 75 39 jne b0
All these instructions implements the if statement! Again we have load from memory (5f), but checking which bit is set require much more work. Input (edx) is split to lower part --- i.e. bit number (67, 6c) and higher part --- i.e. word index (6c). Last step is to check if a bit is set in a word --- GCC used variable shift left (6f), but x86 has BT instruction, so in a perfect code we would have 2 instructions less.
However, as we see simple access in the bitset is much more complicated than simple memory fetch from byteset. For short sets memory fetches are well cached and smaller number of instruction improves performance. For really large set cache misses would kill performance, then bitset is much better choice.
środa, 19 marca 2014
Is const-correctness paranoia good?
Yes, definitely. Lets see this simple example:
$ cat test.cpp int test(int x) { if (x = 1) return 42; else return 0; } $ g++ -c test.cpp $ g++ -c -Wall test.cpp int test(int x) { if (x = 1) return 42; else return 0; }Only when we turn on warnings, compiler tell us about a possible error. Making the parameter const shows us error:
$ cat test2.cpp int test(int x) { if (x = 1) return 42; else return 0; } $ g++ -c test.cpp test2.cpp: In function ‘int test(int)’: test2.cpp:2:8: error: assignment of read-only parameter ‘x’ if (x = 1) ^All input parameters should be const, all write-once variables serving as a parameters for some computations should be also const.
Quick and dirty ad-hoc git hosting
Recently I needed to synchronize my local repository with a remote machine, just for full backup. It's really simple if you have standard Linux tools (Cygwin works too, of course).
1. in a working directory run:
2. in a parent directory start HTTP server:
3. on a remote machine clone/pull/whatever:
1. in a working directory run:
$ pwd /home/foo/project $ git update-server-info
2. in a parent directory start HTTP server:
$ cd .. $ pwd /home/foo $ python -m SimpleHTTPServer Serving HTTP on 0.0.0.0 port 8000 ...
3. on a remote machine clone/pull/whatever:
$ git clone http://your_ip:8000/project/.gitStep 1 have to be executed manually when local repository has changed.
niedziela, 16 marca 2014
Scalar version of SSE move mask instruction
SSE instruction PMOVMSKB gathers all most significant bits from bytes and stores them as a single 16-bit value; similar action is performed by MOVMSKPD and MOVMSKPS.
Such operation could be easily done using scalar multiplication. Read more ...
Such operation could be easily done using scalar multiplication. Read more ...
wtorek, 11 marca 2014
C++ standard inaccuracy
First we read:
21.4.1 basic_string general requirements [string.require]... few pages later:
[...]
3 No erase() or pop_back() member function shall throw any exceptions.
21.4.6.5 basic_string::erase [string::erase]
basic_string<charT,traits,
Allocator>& erase(size_type pos = 0, size_type n = npos);
[...]
2 Throws: out_of_range if pos > size().
SIMD-friendly Rabin-Karp modification
Rabin-Karp algorithm uses a weak hash function to locate possible substring positions.
This modification uses merely equality of first and last char of searched substring, however equality of chars can be done very fast in parallel, even without SIMD instruction. Read more ...
niedziela, 9 marca 2014
GCC - asm goto
Starting from GCC 4.5 the asm statement has new form: asm goto. Programmer can use any labels from C/C++ program, however output from this block is not allowed.
Using asm block in old form required more instructions and additional storage:
Using asm block in old form required more instructions and additional storage:
uint32_t bit_set; asm ( "bt %2, %%eax \n" "setc %%al \n" "movzx %%al, %%eax \n" : "=a" (bit_set) : "a" (number), "r" (bit) : "cc" ); if (bit_set) goto has_bit;Above code is compiled to:
80483f6: 0f a3 d8 bt %ebx,%eax 80483f9: 0f 92 c0 setb %al 80483fc: 0f b6 c0 movzbl %al,%eax 80483ff: 85 c0 test %eax,%eax 8048401: 74 16 je 8048419With asm goto the same task could be writting shorter and easier:
asm goto ( "bt %1, %0 \n" "jc %l[has_bit] \n" : /* no output */ : "r" (number), "r" (bit) : "cc" : has_bit // name of label );Complete demo is available. See also GCC documentation about Extended Asm.
Integer log 10 of unsigned integer - SIMD version
Fast calculate ceil(log10(x)) of some unsigned number is described on Bit Twiddling Hacks, this text show the SIMD solution for 32-bit numbers.
Algorithm:
1. populate value in XMM registers. Since maximum value of this function is 10 we need three registers.
Sample program is available.
Algorithm:
1. populate value in XMM registers. Since maximum value of this function is 10 we need three registers.
movd %eax, %xmm0 // xmm0 = packed_dword(0, 0, 0, x) pshufd $0, %xmm0, %xmm0 \n" // xmm0 = packed_dword(x, x, x, x) movapd %xmm0, %xmm1 movapd %xmm0, %xmm22. compare these numbers with sequence of powers of 10.
// powers_a = packed_dword(10^1 - 1, 10^2 - 1, 10^3 - 1, 10^4 - 1) // powers_c = packed_dword(10^5 - 1, 10^6 - 1, 10^7 - 1, 10^8 - 1) // powers_c = packed_dword(10^9 - 1, 0, 0, 0) pcmpgtd powers_a, %xmm0 pcmpgtd powers_b, %xmm1 pcmpgtd powers_c, %xmm2result of comparisons are: 0 (false) or -1 (true), for example:
xmm0 = packed_dword(-1, -1, -1, -1) xmm1 = packed_dword( 0, 0, -1, -1) xmm2 = packed_dword( 0, 0, 0, 0)3. calculate sum of all dwords
psrld $31, %xmm0 // xmm0 = packed_dword( 1, 1, 1, 1) - convert -1 to 1 psubd %xmm1, %xmm0 // xmm0 = packed_dword( 1, 1, 2, 2) psubd %xmm2, %xmm0 // xmm0 = packed_dword( 1, 1, 2, 2) // convert packed_dword to packed_word pxor %xmm1, %xmm1 packssdw %xmm1, %xmm0 // xmm0 = packed_word(0, 0, 0, 0, 1, 1, 2, 2) // max value of word in xmm0 is 3, so higher // bytes are always zero psadbw %xmm1, %xmm0 // xmm0 = packded_qword(0, 6)4. save result, i.e. the lowest dword
movd %xmm0, %eax // eax = 6
Sample program is available.
Mask for zero/non-zero bytes
The description of Determine if a word has a zero byte from "Bit Twiddling Hacks" says about haszero(v): "the result is the high bits set where the bytes in v were zero".
Unfortunately this is not true. High bits are also set for ones followed zeros, i.e. haszero(0xff010100) = 0x00808080. Of course the result is still valid (non-zero if there were any zero byte), but if we want to iterate over all zeros or find last zero index, this could be a problem.
It's possible to create exact mask:
Function nonzeromask requires 4 simple instructions, and zeromask one additional xor.
Unfortunately this is not true. High bits are also set for ones followed zeros, i.e. haszero(0xff010100) = 0x00808080. Of course the result is still valid (non-zero if there were any zero byte), but if we want to iterate over all zeros or find last zero index, this could be a problem.
It's possible to create exact mask:
uint32_t nonzeromask(const uint32_t v) { // MSB are set if any of 7 lowest bits are set const uint32_t nonzero_7bit = (v & 0x7f7f7f7f) + 0x7f7f7f7f; return (v | nonzero_7bit) & 0x80808080; } uint32_t zeromask(const uint32_t v) { // negate MSBs return (nonzeromask(v)) ^ 0x80808080; }
Function nonzeromask requires 4 simple instructions, and zeromask one additional xor.
poniedziałek, 3 marca 2014
Slow-paths in GNU libc strstr
I've observed that some patterns issued to strstr cause significant
slowdown.
Sample program kill-strstr.c executes strstr(data, pattern), where data is a large string (16MB) filled with character ?; patterns are read from command line.
On my machine following times were recorded:
Sample program kill-strstr.c executes strstr(data, pattern), where data is a large string (16MB) filled with character ?; patterns are read from command line.
On my machine following times were recorded:
1. searching string 'johndoe'... time: 0.032 2. searching string '??????????????????a'... time: 0.050 3. searching string '??????????????????????????????a'... time: 0.049 4. searching string '???????????????????????????????a'... time: 0.274 5. searching string '??????????????????????????????a?'... time: 0.356 6. searching string '??????????????????????????????a??????????????????????????????'... time: 0.396
- Slowdown is visible in case 4 (5 times slower than pattern 3). Pattern has 32 characters, and contains '?', except last char.
- Even bigger slowdown occurs in case 5 (7 times slower). This pattern also contains 32 chars, but position of the single letter 'a' is last but one.
- Similar slowdown occurs in case 5 (nearly 8 times slower). In this pattern single letter 'a' is surrounded by 30 '?'.
niedziela, 26 stycznia 2014
Penalties of errors in SSE floating point calculations
SSE provides not widely known control register, called MXCSR. This register plays three roles:
- controls calculations:
- flag "flush to zero" (described later)
- flag "denormals are zeros" (described later)
- rounding mode (not covered in this text)
- allow to mask/unmask floating-point exceptions
- save information about floating-point errors - these flags are sticky, i.e. the programmer is responsible for clearing them.
środa, 1 stycznia 2014
x86 - ISA where 80% of instructons are unimportant
Few years ago I counted instructions from many Linux binaries --- 2014 is good year to repeated this experiment and see that nothing has changed.
I use 32-bit Debian, my installation has been updated few months ago. All files from /usr/bin and all *.so files from /usr/lib was disassembled with objdump (5050 files were processed). Instructions were grouped simply by mnemonic name, taking into account all addressing and encoding modes would be overkill. I've published script that does the job.
Short summary
- Number of distinct decoded instructions is around 650. Since objdump use AT&T syntax, same opcode is seen under different mnemonics, for example mov is saved as movw, movb, movl depending on argument size.
- Total number of x86 instructions is around 750. Read: one hundred instructions never appeared in the binaries.
- There are 81 instructions used just once. For example quite useful CMPPD.
- There are 22 instructions used twice. For example MFENCE --- no one aware of memory ordering?
- There are 15 instructions used three times. For example BTC, but bit manipulating operations are useless.
- 81 plus 22 plus 15 is 118. Another hundred of useless stuff.
- The total count of these instructions is 87.84% of all instructions (almost all, isn't it?).
- The most frequently used instruction is data transfer (mov/movl) --- 42%
- Control flow instructions (call/ret/jmp) --- 13%.
- Conditions (cmp/test/condition jumps: je/jne) --- 10%.
- Basic arithmetic (add/sub/lea) --- 12%
- Simple stack operations (push/pop) --- 6%
Very interesting observation is that conditions are mostly based on je/jne, i.e. jump if zero/jump if not zero.
First FPU instruction appear at 28-th position. First integer SSE appear at 167-th position. First SSE instruction operating on packed floats appear at 315-th position.
Detailed results
Whole table as txt file.instruction | count | % |
---|---|---|
mov | 5934098 | 37.63% |
call | 1414355 | 8.97% |
lea | 1071501 | 6.79% |
movl | 760677 | 4.82% |
push | 655921 | 4.16% |
jmp | 611540 | 3.88% |
add | 560517 | 3.55% |
je | 490250 | 3.11% |
test | 475899 | 3.02% |
pop | 441608 | 2.80% |
sub | 366228 | 2.32% |
cmp | 326379 | 2.07% |
jne | 264110 | 1.67% |
nop | 242356 | 1.54% |
ret | 238569 | 1.51% |
xor | 148194 | 0.94% |
movzbl | 122730 | 0.78% |
and | 88863 | 0.56% |
xchg | 66885 | 0.42% |
cmpl | 64907 | 0.41% |
movzwl | 64589 | 0.41% |
movb | 57247 | 0.36% |
or | 52138 | 0.33% |
shl | 50908 | 0.32% |
cmpb | 50152 | 0.32% |
jle | 41083 | 0.26% |
leave | 39923 | 0.25% |
fldl | 37428 | 0.24% |
fstpl | 37368 | 0.24% |
shr | 36503 | 0.23% |
jbe | 32866 | 0.21% |
ja | 32333 | 0.21% |
sar | 30917 | 0.20% |
flds | 29672 | 0.19% |
subl | 27636 | 0.18% |
setne | 27626 | 0.18% |
testb | 27420 | 0.17% |
addl | 25906 | 0.16% |
imul | 25569 | 0.16% |
jg | 24796 | 0.16% |
fstp | 24349 | 0.15% |
fxch | 23464 | 0.15% |
js | 21550 | 0.14% |
fstps | 21248 | 0.13% |
sbb | 16607 | 0.11% |
inc | 16200 | 0.10% |
lock | 16049 | 0.10% |
jae | 14825 | 0.09% |
sahf | 14765 | 0.09% |
dec | 14276 | 0.09% |
fnstsw | 14026 | 0.09% |
sete | 13902 | 0.09% |
movw | 13895 | 0.09% |
adc | 13640 | 0.09% |
jb | 12467 | 0.08% |
jl | 11700 | 0.07% |
repz | 11178 | 0.07% |
fldcw | 11110 | 0.07% |
jge | 11019 | 0.07% |
movswl | 10816 | 0.07% |
fildl | 8852 | 0.06% |
cmpw | 7601 | 0.05% |
jns | 7490 | 0.05% |
fldz | 7331 | 0.05% |
fmul | 7229 | 0.05% |
out | 7203 | 0.05% |
not | 7028 | 0.04% |
movsbl | 6720 | 0.04% |
in | 6503 | 0.04% |
fld | 6309 | 0.04% |
faddp | 6254 | 0.04% |
fstl | 5760 | 0.04% |
fucom | 5753 | 0.04% |
neg | 5725 | 0.04% |
fucompp | 5354 | 0.03% |
rep | 5059 | 0.03% |
fmuls | 5039 | 0.03% |
pushl | 4430 | 0.03% |
jp | 4424 | 0.03% |
fnstcw | 4400 | 0.03% |
fld1 | 4176 | 0.03% |
fmulp | 4133 | 0.03% |
orl | 3927 | 0.02% |
fadds | 3789 | 0.02% |
movq | 3779 | 0.02% |
fistpl | 3709 | 0.02% |
cltd | 3597 | 0.02% |
fmull | 3313 | 0.02% |
stos | 3298 | 0.02% |
lret | 3183 | 0.02% |
scas | 3103 | 0.02% |
lods | 3066 | 0.02% |
cwtl | 3064 | 0.02% |
fadd | 2852 | 0.02% |
fucomp | 2678 | 0.02% |
orb | 2481 | 0.02% |
fildll | 2418 | 0.02% |
andl | 2379 | 0.02% |
setb | 2337 | 0.01% |
andb | 2263 | 0.01% |
552 rows more... |
Subskrybuj:
Posty (Atom)