Milo Yip has compared different itoa and dtoa implementations on Core i7, including my itoa algorithm 2, that use SSE2 instructions.
Results for itoa are interesting: SSE2 version is not as good as it seemed to be. Tricky branchlut algorithm is only 10% slower, moreover is perfectly portable. One obvious drawback of this method is using lookup-table - in real environment where is a big pressure on cache, memory access could be a bottleneck.
niedziela, 30 listopada 2014
sobota, 22 listopada 2014
Simple Testing Can Prevent Most Critical Failures
I recommend very interesting paper. Authors studied many errors in complicated distributed systems, like Cassandra, and found that majority of failures are caused by trivial errors (some of them can be detected even in unit tests). Here is very interesting quote:
almost all (92%) of the catastrophic system failures are the result of incorrect handling of non-fatal errors explicitly signaled in software.
In my opinion causes of errors spotted in the study may apply to any kind of software.
almost all (92%) of the catastrophic system failures are the result of incorrect handling of non-fatal errors explicitly signaled in software.
In my opinion causes of errors spotted in the study may apply to any kind of software.
niedziela, 16 listopada 2014
Speeding up searching in linked list
Sounds crazy, but it's possible in some cases. Here are experiments results - 3 times faster isn't so bad.
list : 0.780s, speedup 1.00 array list (4) : 0.703s, speedup 1.11 array list (8) : 0.515s, speedup 1.51 SIMD array list (4) : 0.365s, speedup 2.14 SIMD array list (8) : 0.258s, speedup 3.03
piątek, 14 listopada 2014
MSVC 2013 Release code
Today I had a quite long session with debugger and release code, you know: no debugger symbols and optimized code. I spotted two pieces of assembly that forced me to check if compilation mode was really set to release.
First dl is stored in memory. It's ok. Then edx is used as an offset. Also ok. Seems that compiler knows that highest bits of edx are zeros i.e. edx = dl. Not so fast - edx is reloaded with it's original value and then eax is populated with the same value. These two movzx could be replaced with single mov eax, edx.
And another:
Yes, load & store the same value. Completely useless! Moreover, xmm6 isn't used in following code. (It's worth noting that load is done by an FP-unit and store by integer unit, inter-unit transfers cost one additional cycle on some processor.)
Above instruction sequences were produced by the newest compiler from Microsoft (MSVC 2013, update 3).
00000000000000D9 mov byte ptr [rcx+1Fh],dl 00000000000000DC mov byte ptr [rdx+rcx],0 00000000000000E0 movzx edx,byte ptr [rcx+1Fh] 00000000000000E4 movzx eax,dl
First dl is stored in memory. It's ok. Then edx is used as an offset. Also ok. Seems that compiler knows that highest bits of edx are zeros i.e. edx = dl. Not so fast - edx is reloaded with it's original value and then eax is populated with the same value. These two movzx could be replaced with single mov eax, edx.
And another:
00000000000000BF movaps xmm6,xmmword ptr [foo] 00000000000000C4 movdqa xmmword ptr [foo],xmm6
Yes, load & store the same value. Completely useless! Moreover, xmm6 isn't used in following code. (It's worth noting that load is done by an FP-unit and store by integer unit, inter-unit transfers cost one additional cycle on some processor.)
Above instruction sequences were produced by the newest compiler from Microsoft (MSVC 2013, update 3).