tag:blogger.com,1999:blog-49004865888625770972024-03-14T07:27:05.574+01:00Notes on computers, programming and all the stuffWojciech Mułahttp://www.blogger.com/profile/08330413422626283672noreply@blogger.comBlogger123125tag:blogger.com,1999:blog-4900486588862577097.post-59862448510734297972018-12-10T22:46:00.001+01:002018-12-10T22:46:38.970+01:00This blog is obsoleteI'm available on <a href="https://twitter.com/pshufb" target="_blank">Twitter</a>. The up-to-date list of my articles or short notes is available on my homepage <a href="http://0x80.pl/" target="_blank">0x80.pl</a>. <br />
<br />
I used to publish here announcements or short notes. For various reasons it didn't work well. This blog probably won't be updated any more.Wojciech Mułahttp://www.blogger.com/profile/08330413422626283672noreply@blogger.com0tag:blogger.com,1999:blog-4900486588862577097.post-55489384382934397042016-05-01T10:19:00.002+02:002016-05-01T10:19:17.408+02:00GCC: and inlining failed in call to always_inline 'FOO': target specific option mismatchAVX512 comes with the number of variants, and a compiler must know which AVX512 version it compiles.<br />
<br />
GCC error<b> inlining failed in call to always_inline 'FOO': target specific option mismatch</b> occurs when a program containing some SIMD-intrinsics, and compiler has wrong or missing target options. The target option are introduced by "-m".<br />
<br />
Lets look at the error from real world:<br />
<br />
<pre>/usr/lib/gcc/x86_64-linux-gnu/5/include/avx512vlbwintrin.h:790:1: error: inlining failed in call to always_inline ‘_mm_movm_epi8’: target specific option mismatch
</pre>
<br />
Now, when open avx512vlbwintrin.h, we see at the beginning of file:<br />
<br />
<pre>...
#pragma GCC push_options
#pragma GCC target("avx512vl,avx512bw")
#define __DISABLE_AVX512VLBW__
...
</pre>
<br />
Thus, in order to properly compile the program, gcc have to be feed with the two options listed at the target line: <b>-mavx512vl</b> and <b>-mavx512bw</b>.Wojciech Mułahttp://www.blogger.com/profile/08330413422626283672noreply@blogger.com0tag:blogger.com,1999:blog-4900486588862577097.post-45796489474569017912016-02-13T10:43:00.000+01:002016-02-13T10:43:33.956+01:00bash: $0 valueWhen a script is run from a command line then the 0th parameter is the script's name. However, when a script is run via <b>source</b> command, then the 0th parameter is a shell name. Weird, but true.<br />
<br />
<br />
<pre>$ cat test.sh
echo "\$0 is $0"
$ bash test.sh
$0 is test.sh
$ source ./test.sh
$0 is bash
</pre>
Wojciech Mułahttp://www.blogger.com/profile/08330413422626283672noreply@blogger.com3tag:blogger.com,1999:blog-4900486588862577097.post-31264869548266942302016-01-17T10:42:00.001+01:002016-01-17T10:42:06.917+01:00Base64 encoding with SIMD instructionsBase64 decoding could also be vectorized, although the speedup is not very impressive, merely 35%. <a href="http://0x80.pl/notesen/2016-01-17-sse-base64-decoding.html" target="_blank"><span class="tco-ellipsis"></span>Read more ...</a><span class="invisible"></span><span class="tco-ellipsis"></span>Wojciech Mułahttp://www.blogger.com/profile/08330413422626283672noreply@blogger.com0tag:blogger.com,1999:blog-4900486588862577097.post-82522406559973339542016-01-12T16:15:00.000+01:002016-01-12T16:15:01.366+01:00Base64 encoding with SIMD instructionsAn SSE code is more
than <strong>2 times</strong> faster on Core i7, and around <strong>70%</strong> faster on Core i5. <a href="http://0x80.pl/notesen/2016-01-12-sse-base64-encoding.html" target="_blank">Read more...</a>Wojciech Mułahttp://www.blogger.com/profile/08330413422626283672noreply@blogger.com0tag:blogger.com,1999:blog-4900486588862577097.post-172860141477756332015-12-29T21:00:00.001+01:002015-12-29T21:00:25.684+01:00Fast conversion of floating-point values to stringThe conversion to string could be 15 times faster than <tt>sprintf</tt>. <a href="http://0x80.pl/notesen/2015-12-29-float-to-string.html" target="_blank">Read more...</a>Wojciech Mułahttp://www.blogger.com/profile/08330413422626283672noreply@blogger.com0tag:blogger.com,1999:blog-4900486588862577097.post-76927895724457292772015-12-27T19:27:00.000+01:002015-12-27T19:27:48.464+01:00Base64 encoding — implementation studyAlthough base64 encoding is a very basic algorithm, it could be sped up a little (25% sounds good?) <a href="http://0x80.pl/notesen/2015-12-27-base64-encoding.html" target="_blank">Read more...</a>Wojciech Mułahttp://www.blogger.com/profile/08330413422626283672noreply@blogger.com0tag:blogger.com,1999:blog-4900486588862577097.post-81891434622203228932015-12-26T22:24:00.000+01:002015-12-27T09:30:59.562+01:00Benefits from the obsessionEverything has started few years ago when I found <a class="reference external" href="http://blog.regehr.org/">John Regher's blog</a>. If
you don't know the blog I highly recommend it. Among other things (I like
the photos!) the author studies bugs in compilers, undefined behaviours and
similar things. The word "overflow" appears quite often in his posts due to
the great number of errors related to improper use of the integer
arithmetic. Well, I don't know when the obsession has exactly started, but
recently I realized that I am alert of all integer operations in my
programs. <a href="http://0x80.pl/notesen/2015-12-13-obsession.html" target="_blank">Read more...</a>Wojciech Mułahttp://www.blogger.com/profile/08330413422626283672noreply@blogger.com0tag:blogger.com,1999:blog-4900486588862577097.post-1779584361878248802015-11-28T14:02:00.001+01:002015-11-28T14:02:36.236+01:00Implicit conversion - the enemyI wrote:<br />
<br />
<pre> result += string_utils::pad_left(string, '0');
</pre>
<br />
I forget that <tt>pad_left</tt> signature is <tt>string, int, char</tt> and the char parameter has a default value. My mistake, without doubts.<br />
<br />
This is another example of dark sides of the implicit conversions. C++ converts between characters and integers seamlessly. These two beast are distinct in the nature. Of course characters are <b>represented</b> by the numbers, however it's an implementation detail.<br />
<br />
One can say: you made a mistake and now blame the language. No, I blame language's design. I'm afraid that we end up with something like <tt>Integer</tt> and <tt>int</tt> to overcome such problems.<br />
<br />
Lesson learned: never use default parameters in public API (surprise!)Wojciech Mułahttp://www.blogger.com/profile/08330413422626283672noreply@blogger.com0tag:blogger.com,1999:blog-4900486588862577097.post-91400008367986796312015-11-22T15:34:00.000+01:002015-11-23T19:15:10.009+01:00Another C++ nasty featureI'm fond of C++ weirdness, really. This language is full of traps, and it shocks me once in a while.<br />
<br />
Let's look at this piece of code, a part of a larger module:<br />
<br />
<pre>void validate_date() {
// ...
boost::optional<unsigned> clock_hour;
boost::optional<unsigned> am_pm_clock;
// ... fill these fields
if (<i>some sanity check failed</i>) {
report_error("user has entered wrong time: %d %s",
*clock_hour
*am_pm_clock ? "AM" : "PM");
}
}
</pre>
<br />
We would expect that in case of an error following line will be reported: "user has entered wrong time: 123 PM". Obvious. But please look closer at the code, do you see any mistake? There is one... dirty... hard to notice. I'll give you a minute.<br />
<br />
So, the mistake is lack of comma between expressions <b>*clock_hour</b> and <b>*am_pm_clock</b>. However, the code is valid! It compiles! And it took me a little longer than a minute to understand what happened. Explanation is:<br />
<ul>
<li><tt>*clock_hour</tt> evaluates to expression of type <tt>unsigned</tt>;</li>
<li>then compiler sees <tt>*</tt> - a multiplication operator;</li>
<li>so checks if multiplication of <tt>unsigned</tt> (on the left side) with <tt>boost::optional<unsigned></tt> (on the right side) is possible;</li>
<li>it is, because <tt>boost::optional<T></tt> has conversion operator to type <tt>T</tt>.</li>
</ul>
We can rewrite the whole expression, now it should be clear:<br />
<br />
<pre> ((*clock_hour) * unsigned(am_pm_clock)) ? "AM" : "PM"
</pre>
<br />
In result method is called with <b>a single parameter</b> of type <tt>cont char*</tt>.<br />
<br />
It's bizarre, it's terrible. A language should help a programmer. In my opinion implicit conversions is the worst feature of C++.
Wojciech Mułahttp://www.blogger.com/profile/08330413422626283672noreply@blogger.com3tag:blogger.com,1999:blog-4900486588862577097.post-39531197679202314752015-11-15T16:48:00.000+01:002015-12-26T22:22:14.323+01:00Short report from code::dive 2015Few days ago I attended <b>code::dive 2015</b>, an IT conference in Wrocław, Poland. It was a one-day conference with a great number of presentations. There were four stages and five sessions, in total 20 talks. Impressive number! But an attender had to choose his own path of just five lectures. I think the decision was pretty difficult. Sometimes less is better. <a href="http://0x80.pl/notesen/2015-11-15-code-dive.html" target="_blank">Read more</a>Wojciech Mułahttp://www.blogger.com/profile/08330413422626283672noreply@blogger.com0tag:blogger.com,1999:blog-4900486588862577097.post-82016386385536916522015-07-15T20:15:00.003+02:002015-07-15T20:15:48.176+02:00C++ magickA programmer wrote:<br />
<br />
<pre>class container;
class IndexOutOfBounds {
public:
IndexOutOfBounds(const std::string& msg);
};
void container::remove(int index) {
if (index < 0 || index >= size()) {
throw new IndexOutOfBounds("Invalid index: " + index);
}
// the rest of method
}
</pre>
<br />
Do you see the mistake? Programmer assumed that expression <tt>"Invalid index: " + index</tt> evaluates to <tt>std::string("Invalid index: <some number>")</tt>.<br />
<br />
In fact type of expression <tt>"Invalid index: "</tt> is <tt>char[15]</tt>, so <tt>char[15] + integer</tt> results in --- more or less --- <tt>char*</tt>. For <tt>index</tt> in range [0, 15] exception will carry tail of the message; for example when <tt>index=10</tt> then it will be <tt>"dex: "</tt>. But for indexes larger than 15 and less than 0 <b>program likely crash</b>.<br />
<br />
This is why I hate C++, the language has many dark corners, stupid conventions, implicit conversion, not to mention UB ("just" 150 UB, if you're curious).Wojciech Mułahttp://www.blogger.com/profile/08330413422626283672noreply@blogger.com0tag:blogger.com,1999:blog-4900486588862577097.post-72784359421071674242015-06-20T08:37:00.001+02:002015-06-20T08:37:26.360+02:00Implementation of BT-trees<a href="http://arxiv.org/abs/1505.01210" target="_blank">Great paper</a> by <b>Lars F. Bonnichsen</b>, <b>Christian W. Probst</b>, <b>Sven Karlsson</b>:
<br />
<blockquote>
This document presents the full implementation details of BT-trees, a <b>highly efficient ordered map</b>, and an evaluation which compares BT-trees with unordered maps. BT- trees are often much faster than other ordered maps, and have comparable performance to unordered map implementations. However, in benchmarks which favor unordered maps, BT-trees are not faster than the fastest unordered map implementations we know of.
</blockquote>
Wojciech Mułahttp://www.blogger.com/profile/08330413422626283672noreply@blogger.com0tag:blogger.com,1999:blog-4900486588862577097.post-39747946045136645882015-06-20T08:06:00.000+02:002015-12-27T09:29:57.260+01:00Boolean function for the rescueThe problem is defined as follows: a set of features is saved using bit-sets (usually large), and there is a list/map/whatever of sets containing features of different objects. We have to find which features are unique. <a href="http://0x80.pl/notesen/2015-10-25-boolean-functions.html" target="_blank">Read more...</a>Wojciech Mułahttp://www.blogger.com/profile/08330413422626283672noreply@blogger.com0tag:blogger.com,1999:blog-4900486588862577097.post-41027174740316301642015-06-10T08:53:00.001+02:002015-06-10T08:53:31.533+02:00Big progress in verificationFormal verification is not easy task, for example <a href="http://compcert.inria.fr/" target="_blank">ComCert</a> compiler is able to verify, that optimizations haven't modified semantic of a program. Paper <a href="http://www.cs.princeton.edu/~appel/papers/verified-hmac.pdf?v=0" target="_blank">Verified correctness and security of OpenSSL HMAC </a>describes verification of the whole "stack":<br />
<blockquote>
We have proved, with machine-checked proofs in Coq, that an OpenSSL implementation of HMAC with SHA-256 correctly implements its FIPS functional specification and that its functional specification guarantees the expected cryptographic properties. This is the first machine-checked cryptographic proof that combines a source-program implementation proof, a compiler-correctness proof, and a cryptographic-security proof, with no gaps at the specification interfaces.
</blockquote>
Wojciech Mułahttp://www.blogger.com/profile/08330413422626283672noreply@blogger.com0tag:blogger.com,1999:blog-4900486588862577097.post-36319151561111792772015-05-22T12:58:00.001+02:002015-05-22T12:58:10.280+02:00Fast exact summation using small and large superaccumulators<a href="http://arxiv.org/abs/1505.05571" target="_blank">Interesting article</a> by <b>Radford M. Neal</b>:
<br />
<blockquote>
I present two new methods for exactly summing a set of floating-point
numbers, and then correctly rounding to the nearest floating-point number.
Higher accuracy than simple summation (rounding after each addition) is
important in many applications, such as finding the sample mean of data.
</blockquote>
Wojciech Mułahttp://www.blogger.com/profile/08330413422626283672noreply@blogger.com0tag:blogger.com,1999:blog-4900486588862577097.post-90081217251516330692015-05-20T10:10:00.003+02:002015-05-20T10:10:35.041+02:00Optimizing Dijkstra for real-world performance<a href="http://arxiv.org/abs/1505.05033" target="_blank">Another interesting paper</a>:<br />
<br />
<blockquote>
Our experimental results currently put our prototype implementation at <b>about
twice as fast</b> as the Boost implementation of the algorithm on both real-world
and generated large graphs. Furthermore, this preliminary implementation was
written in only a few weeks, by a single programmer. The fact that such an
early prototype compares favorably against Boost, a well-known open source
library developed by expert programmers, gives us reason to believe our design
for the queue is indeed better suited to the problem at hand, and the favorable
time measurements are not a product of any specific implementation technique we
employed.
</blockquote>
Wojciech Mułahttp://www.blogger.com/profile/08330413422626283672noreply@blogger.com0tag:blogger.com,1999:blog-4900486588862577097.post-57579960852602040502015-04-21T07:49:00.000+02:002015-04-21T07:49:04.985+02:00The Influence of Malloc Placement on TSX Hardware Transactional Memory<a href="http://arxiv.org/abs/1504.04640" target="_blank">Interesting paper</a>:<br />
<br />
<blockquote>
We show that the placement
policies of dynamic storage allocators -- such as those found in common
"malloc" implementations -- can influence the L1 conflict miss rate in the L1.
Conflict misses -- sometimes called mapping misses -- arise because of less
than ideal associativity and represent imbalanced distribution of active memory
blocks over the set of available L1 indices. Under transactional execution
conflict misses may manifest as aborts, representing wasted or futile effort
instead of a simple stall as would occur in normal execution mode.
</blockquote>
Wojciech Mułahttp://www.blogger.com/profile/08330413422626283672noreply@blogger.com0tag:blogger.com,1999:blog-4900486588862577097.post-54200510439561386502015-04-19T17:44:00.002+02:002015-04-19T17:44:18.855+02:00Conversion numbers to binary ASCII representation - new methodRecently I've checked different methods to <a href="http://0x80.pl/articles/convert-to-bin.html" target="_blank">convert numbers to binary representation</a>, including use of new <b>PDEP</b> instruction from <a href="http://en.wikipedia.org/wiki/Bit_Manipulation_Instruction_Sets" target="_blank">BMI2 extension</a>.<br />
<br />
Today I've updated the article with new SWAR version 2, a tricky use of multiplication. The method is not faster, but I like the approach---in certain conditions multiplication can be seen as multi-shift/bit-or instruction. I've already use multiplication in this way to <a href="http://0x80.pl/articles/scalar-sse-movmask.html" target="_blank">emulate instruction <b>pmovmskb</b></a>.Wojciech Mułahttp://www.blogger.com/profile/08330413422626283672noreply@blogger.com0tag:blogger.com,1999:blog-4900486588862577097.post-24787486103978474192015-04-13T20:59:00.000+02:002015-04-13T20:59:33.111+02:00 Speeding up bit-parallel population count<a href="http://0x80.pl/articles/faster-popcount-for-large-data.html" target="_blank">Nearly 50% faster than naive version</a> for large data sets. Discovered by accident. :)Wojciech Mułahttp://www.blogger.com/profile/08330413422626283672noreply@blogger.com0tag:blogger.com,1999:blog-4900486588862577097.post-23303077794043512872015-04-09T21:09:00.000+02:002015-04-09T21:09:22.421+02:00Github repositoriesI've put source code for my two articles at github:<br />
<ul>
<li>
<a href="http://0x80.pl/articles/sse-popcount.html" target="_blank">SSSE3: fast popcount</a> --- <a href="https://github.com/WojciechMula/sse-popcount" target="_blank">repository</a></li>
<li><a href="http://0x80.pl/articles/sse4_substring_locate.html" target="_blank">SSE4 string search - modification of Karp-Rabin algorithm </a> --- <a href="https://github.com/WojciechMula/sse4-strstr" target="_blank">repository</a></li>
</ul>
Repositories contain original code, read: C99, 32-bit for GCC with inline assembly and also new programs in C++11 using intrinsics, tested in 64-bit environment.<br />
<br />
<br />
BTW the article about popcount has gained popularity, and I hope another crazy idea about hacking <b><tt>MPSADBW</tt></b> will spread all over the world.<br />
<ul>
</ul>
Wojciech Mułahttp://www.blogger.com/profile/08330413422626283672noreply@blogger.com0tag:blogger.com,1999:blog-4900486588862577097.post-58024528811305990212015-04-09T20:48:00.004+02:002015-04-09T20:48:49.894+02:00SIMD-ized searching in unique constant dictionaryThe problem: there is a <strong>ordered dictionary</strong> containing only
<strong>unique</strong> keys. Dictionary is read only, and keys are 32-bit (SSE) or
64-bit (AVX2). <a href="http://0x80.pl/articles/simd-search.html" target="_blank">Read more</a>Wojciech Mułahttp://www.blogger.com/profile/08330413422626283672noreply@blogger.com0tag:blogger.com,1999:blog-4900486588862577097.post-48809572162664352322015-03-22T19:32:00.000+01:002016-01-17T10:43:46.640+01:00SIMD: detecting a bit patternThe problem: there are 64-bit values with some data bits and some metadata bits; metadata includes a k-bit field describing a "type" (k >= 0). Type field is located in a lower 32-bits.<br />
<br />
Procedure processes two "types", one denoted with code 3 and another with 5. When all items are of type 3 then we can use a fast AVX2 path, if there are some types 5, we have to call an additional function (a virtual method, to be precise). <a href="http://0x80.pl/articles/simd-pattern.html" target="_blank">Read more ...</a>Wojciech Mułahttp://www.blogger.com/profile/08330413422626283672noreply@blogger.com1tag:blogger.com,1999:blog-4900486588862577097.post-36416834646773728722015-03-22T11:04:00.002+01:002016-01-17T10:45:03.877+01:00Compiler warnings are your future errorsMonths ago I was asked to upgrade GCC from version 4.7 to 4.9 and also cleanup configure scripts. Not very exciting, merely time consuming task. Oh, we were hit by a <a href="https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62258" target="_blank">bug in libstdc++</a>, but simple patch have fixed the problem. Few weeks later I was asked to change GCC switch from <tt>-std=c++11</tt> to <tt>-std=c++14</tt> -- the easiest task in the world. I had to modify single script, run <tt>configure</tt>, type <tt>make</tt>, then run tests... everything was OK. Quite boring so far.<a href="http://0x80.pl/notesen/2015-03-22-compiler-warnings.html" target="_blank"> Read more ...</a>Wojciech Mułahttp://www.blogger.com/profile/08330413422626283672noreply@blogger.com0tag:blogger.com,1999:blog-4900486588862577097.post-55857569245892704832015-03-22T10:02:00.000+01:002016-01-17T10:46:05.234+01:00AVX512: ternary functions evaluationIntel's version of SIMD offers following 2-argument (<i>binary</i>) boolean functions: <b>and</b>, <b>or</b>, <b>xor</b>, <b>and not</b>. There isn't a single argument <b>not</b>, this function can be expressed with <tt>xor reg, ones</tt>, however this require additional, pre-set register.<br />
<br />
<b>AVX512F</b> will come with very interesting instruction called <tt>vpternlog</tt>. <a href="http://0x80.pl/articles/avx512-ternary-functions.html" target="_blank">Read more ...</a>Wojciech Mułahttp://www.blogger.com/profile/08330413422626283672noreply@blogger.com0