niedziela, 10 października 2010
PIC Language
Yesterday I made my first diagram with PIC (see Wikipedia) and... I felt in love. Language is just perfect, intuitive and powerful. Try it and you will never use any popular drawing software.
wtorek, 28 września 2010
Xorg VESA driver - higher resolutions
Default configuration of VESA driver allows resolutions 640x480 and 800x600.
It is quite simple to get higher resolutions - in section Monitor of /etc/X11/xorg.conf we have to set horizontal and vertical refresh rates. By default these values are set to lowest, safe area.
Here are my settings:
X.org is able to work at 1280x1024 with vertical refresh rate 75Hz. HTH
It is quite simple to get higher resolutions - in section Monitor of /etc/X11/xorg.conf we have to set horizontal and vertical refresh rates. By default these values are set to lowest, safe area.
Here are my settings:
Section "Monitor" Identifier "Monitor0" VendorName "Monitor Vendor" ModelName "Monitor Model" HorizSync 42.0 - 80.0 VertRefresh 55.0 - 100.0 EndSection
X.org is able to work at 1280x1024 with vertical refresh rate 75Hz. HTH
wtorek, 24 sierpnia 2010
PostgrSQL: printf in PL/pgSQL
PostgreSQL wiki has entry about sprintf - is is quite simple approach (and isn't marked as immutable), the main drawback is iterating over all chars of format string. Here is a version use strpos to locate % in format string, and it's faster around 2 times:
CREATE OR REPLACE FUNCTION printf2(fmt text, variadic args anyarray) RETURNS text LANGUAGE plpgsql IMMUTABLE AS $$ DECLARE argcnt int := 1; head text := ''; -- result tail text := fmt; -- unprocessed part k int; BEGIN LOOP k := strpos(tail, '%'); IF k = 0 THEN -- no more '%' head := head || tail; EXIT; ELSE IF substring(tail, k+1, 1) = '%' THEN -- escape sequence '%%' head := head || substring(tail, 1, k); tail := substring(tail, k+2); ELSE -- insert argument head := head || substring(tail, 1, k-1) || COALESCE(args[argcnt]::text, ''); tail := substring(tail, k+1); argcnt := argcnt + 1; END IF; END IF; END LOOP; RETURN head; END; $$;
sobota, 17 lipca 2010
SSSE3 population count vs hardware
Peter Kankowski compared speed of SSE4.2 instructions crc32 and popcnt against software implementations. Hardware CRC32 is significantly faster, but population count is slightly slower than my SSSE3 popcount!
niedziela, 9 maja 2010
Branchless set mask if value greater or how to print hex values
Suppose we need to get mask when nonnegative argument is greater then some constant value; in other words, we want to evaluate following expression:
Portable branchless solution:
The key to understand this trick is binary form of M: 0111..1111zzzz, where z is 0 or 1 depending on n value. When x is greater then n, then x + M has form 1000..000zzzz, because carry bit propagate through series of ones to k-th position of result.
Real world example - branchless converting hex digit to ASCII (M=0x7ffffff6 for k=31 and n=9).
It is also possible to convert 4 hex digits in parallel using similar algorithm, but input data have to be correctly prepared. Moreover generating mask requires 3 instructon and one extra register (in scalar version just one arithmetic shift). I guess it wont be fast on x86, maybe this approach would be good for SIMD code, where similar code transforms more bytes at once.
See also: SSSE3: printing hex values (weird use of PSHUFB instruction)
if x > const_n then mask := 0xffffffff; else mask := 0x00000000;
Portable branchless solution:
- choose magic number M := (1 << (k-1)) - 1 - n, where k is a bit position, for example 31 if we operate on 32-bit words
- calculate R := x + M
- k-th bit of R is set if x > n
- fill mask with this bit - see note Fill word with selected bit
The key to understand this trick is binary form of M: 0111..1111zzzz, where z is 0 or 1 depending on n value. When x is greater then n, then x + M has form 1000..000zzzz, because carry bit propagate through series of ones to k-th position of result.
Real world example - branchless converting hex digit to ASCII (M=0x7ffffff6 for k=31 and n=9).
; input: eax - hex digit ; output: eax - ASCII letter (0-9, A-F or a-f) ; destroys: ebx andl 0xf, %eax leal 0x7ffffff6(%eax), %ebx ; MSB(ebx)=1 when eax >= 10 sarl $31, %ebx ; ebx - mask andl $7, %ebx ; ebx = 7 when eax >= 10 (for A-F letters) ;andl $39, %ebx ; ebx = 39 when eax >= 10 (for a-f letters) leal '0'(%eax, %ebx), %eax ; eax = '0' + eax + ebx => ASCII letter
It is also possible to convert 4 hex digits in parallel using similar algorithm, but input data have to be correctly prepared. Moreover generating mask requires 3 instructon and one extra register (in scalar version just one arithmetic shift). I guess it wont be fast on x86, maybe this approach would be good for SIMD code, where similar code transforms more bytes at once.
; input: eax - four hex digits in form [0a0b0c0d] ; output: eax - four ascii letters ; destroys: ebx, ecx leal 0x76767676(%eax), %ebx ; MSB of each byte is set when corresponding eax byte is >= 10 ; (here: 0x7f - 9 = 0x76) andl $0x80808080, %ebx movl %ebx, %ecx shrl $7, %ebx subl %ebx, %ecx ; ecx - byte-wise mask ;andl $0x07070707, %ecx ; for ASCII letters A-F andl $0x27272727, %ecx ; for ASCII letters a-f leal 0x30303030(%eax, %ecx), %eax ; ecx - four ascii letters
See also: SSSE3: printing hex values (weird use of PSHUFB instruction)
sobota, 1 maja 2010
Speedup reversing table of bytes
With help of BSWAP instruction or SSE instructions (PSHUFD, PSHUFLW, PSHUFHW) or SSSE3 instruction (PSHUFB) reversing table can be faster. Speedup depends on three factors:
Read full article
- table size: larger=faster
- table address: aligned=faster/much faster (15.5 speedup - possible! see chart)
- CPU type
Read full article
niedziela, 11 kwietnia 2010
Determining if an integer is a power of 2
Method from Bit Twiddling Hacks: (x != 0) && (x & (x-1) == 0). GCC compiles this to following code:
We can use also BSF and BSR instructions, that determines position of first and last bit=1. If number is power of 2, then just one bit is set, and thus these positions are equal. BSx sets also ZF flag if input value is zero.
; input/ouput: eax ; destroys: ebx test %eax, %eax ; x == 0? jz 1f leal -1(%eax), %ebx ; ebx := x-1 test %eax, %ebx ; ZF := (eax & ebx == 0) setz %al movzx %al, %eax ; eax := ZF 1:
We can use also BSF and BSR instructions, that determines position of first and last bit=1. If number is power of 2, then just one bit is set, and thus these positions are equal. BSx sets also ZF flag if input value is zero.
; input/output: eax ; destroys: ebx, edx bsf %eax, %ebx ; ebx := LSB's position if eax != 0, ZF = 1 if eax = 0 jz 1f bsr %eax, %edx ; edx := MSB's position cmp %ebx, %edx ; ZF := (ebx = edx) setz %al movzx %al, %eax ; eax := ZF 1:
czwartek, 8 kwietnia 2010
Brenchless conditional exchange
Suppose we have to exchange (or just move) two registers A and B:
Here is a sample x86 code, where condition is value of CF:
Branchless moves are possible in Pentium Pro and higher with instructions cmovcc.
See also XOR linked list.
- C := A xor B
- C := 0 if condition is not true
- A := A xor C
- B := B xor C
Here is a sample x86 code, where condition is value of CF:
sbb edx, edx ; part of step 2. - edx = 0xffffff if CF=1, 0x000000 otherwise mov ecx, eax xor ecx, ebx ; step 1 and ecx, edx ; completed step 2. - now C is 0 or (A xor B) xor eax, ecx ; step 3 xor ebx, ecx ; step 4
Branchless moves are possible in Pentium Pro and higher with instructions cmovcc.
See also XOR linked list.
sobota, 3 kwietnia 2010
STL: map with string as key - access speedup
The idea is quite simple: we do not have a single stl::map<string, something>, but vector of maps, indexed with O(1) time - each map store keys sharing certain properties. Drawback: additional memory.
I've tested following grouping schemes:
Tests: my program read plain text (I've used The Illiad from http://gutenberg.org), text is splitted into words (~190000) and then each words is inserted into dictonary (~28000 distinct words); then the same words are searched in dictionaries. Table below summarizes results on my computer (gcc 4.3.4 from cygwin).
Download test program.
I've tested following grouping schemes:
length of string, first letter of string (one level trie), - both length and first letter.
Tests: my program read plain text (I've used The Illiad from http://gutenberg.org), text is splitted into words (~190000) and then each words is inserted into dictonary (~28000 distinct words); then the same words are searched in dictionaries. Table below summarizes results on my computer (gcc 4.3.4 from cygwin).
+-------------+--------------------------------+--------------------------------+ | | running time [ms] | speedup [%] | | data struct +----------+----------+----------+----------+----------+----------+ | | min | avg | max | min | avg | max | +-------------+----------+----------+----------+----------+----------+----------+ | inserting | +-------------+----------+----------+----------+----------+----------+----------+ | std::map | 269 | 287 | 355 | 100 | 100 | 100 | +-------------+----------+----------+----------+----------+----------+----------+ | first char | 218 | 241 | 395 | 81 | 84 | 111 | +-------------+----------+----------+----------+----------+----------+----------+ | length | 218 | 240 | 345 | 81 | 84 | 97 | +-------------+----------+----------+----------+----------+----------+----------+ | len./char | 165 | 172 | 207 | 61 | 60 | 58 | +-------------+----------+----------+----------+----------+----------+----------+ | finding | +-------------+----------+----------+----------+----------+----------+----------+ | std::map | 295 | 322 | 483 | 100 | 100 | 100 | +-------------+----------+----------+----------+----------+----------+----------+ | first char | 243 | 263 | 460 | 82 | 82 | 95 | +-------------+----------+----------+----------+----------+----------+----------+ | length | 238 | 248 | 292 | 80 | 77 | 60 | +-------------+----------+----------+----------+----------+----------+----------+ | len./char | 184 | 190 | 241 | 62 | 60 | 50 | +-------------+----------+----------+----------+----------+----------+----------+
Download test program.
czwartek, 1 kwietnia 2010
Branchless signum
Problem: calculate value of sign(x):
C99 implementation:
- -1 when x < 0
- 0 when x = 0,
- +1 when x > 0.
; input: eax = X movl %eax, %ebx sarl $31, %eax // eax = -1 if X less then zero, 0 otherwise andl $0x7fffffff, %ebx addl $0x7fffffff, %ebx // MSB is set if any lower bits were set shrl $31, $ebx // eax = +1 if X greater then zero, 0 otherwise orl %ebx, %eax // eax = result
C99 implementation:
int32_t sign(int32_t x) { int32_t y; y = (x & 0x7fffffff) + 0x7fffffff; return (x >> 31) | ((uint32_t)y >> 31); }
Fill word with selected bit
This is continuation of subproblem from previous post: we have a word (byte, dword, whatever) and want to fill it with selected bit.
1. The most general algorithm:
1. The most general algorithm:
- mask bit
[10111010] => [00010000]
- clone word
a=[00010000], b=[00010000] - shift bit in first word to MSB, and to LSB in second word
a=[10000000], b=[00000001] - subtract c = a - b
c=[01111111] - add missing MSB c = c OR a
c=[11111111]
- shift bit to MSB
a=[10000000] - arithmetic shift right
a=[11111111]
#include <stdlib.h> #include <stdio.h> #include <stdint.h> uint32_t fill1(uint32_t a, int bit) { uint32_t b; b = a = a & (1 << bit); a <<= 31 - bit; b >>= bit; return (a - b) | a; } uint32_t fill2(uint32_t a, int bit) { a <<= 31 - bit; return (int32_t)(a) >> 31; } uint32_t fill386(uint32_t a, int bit) { uint32_t result; __asm__ __volatile__ ( "bt %1, %0\n" "sbb %0, %0\n" : "=r" (result) : "r" (bit), "0" (a) ); return result; } int main(int argc, char* argv[]) { uint32_t x, i; if (argc > 1) x = (unsigned long)strtol(argv[1], NULL, 0); printf("input = %08x\n", x); for (i=0; i < 32; i++) printf("bit %2d: fill1 = %08x, fill2 = %08x, fill386 = %08x\n", i, fill1(x, i), fill2(x, i), fill386(x, i) ); return EXIT_SUCCESS; }
środa, 31 marca 2010
Transpose bits in byte using SIMD instructions
Method presented here allows to get any bit permutation, transposition is just one of possible operations. Lookup-based approach would be faster, but algorithm is worth to (re)show.
Algorithm outline for 8-byte vector (with SSE instruction it is possible to get 2 operations in parallel):
Ad 1. Series of punpcklbw/punpcklwb/shufps or pshufb if CPU supports SSSE3.
SSE5 has powerful instruction protb that can do perform rotation of each byte with independent amount - so in this case just one instruction is needed.
Algorithm outline for 8-byte vector (with SSE instruction it is possible to get 2 operations in parallel):
- fill vector with given byte
[11010001] =>
[11010001|11010001|11010001|11010001|11010001|11010001|11010001|11010001] - leave one bit per byte
[10000000|01000000|00000000|00010000|00000000|00000000|00000000|00000001] - perform desired transposition ("move" bits around)
[00000001|000000010|00000000|00001000|00000000|00000000|00000000|10000000] - perform horizontal OR of all bytes
[10001011]
Ad 1. Series of punpcklbw/punpcklwb/shufps or pshufb if CPU supports SSSE3.
# 1.1Ad 2. Simple pand with mask packed_qword(0x8040201008040201).
movd %eax, %xmm0
shufps $0x00, %xmm0, %xmm0
punpcklbw %xmm0, %xmm0
punpcklwd %xmm0, %xmm0
# 1.2
pxor %xmm1, %xmm1
movd %eax, %xmm0
pshufb %xmm1, %xmm0
pand MASK1, %xmm0Ad 3. If plain SSE instructions are supported this step require some work. First each bit is populated to fill whole byte (using pcmpeq - we get negated result), then mask bits on desired positons.
SSE5 has powerful instruction protb that can do perform rotation of each byte with independent amount - so in this case just one instruction is needed.
# SSEAd 4. Since bits are placed on distinct positions, we can use instruction psadbw, that calculate horizontal sums of bytewide differences from two registers (separately for low and high registers half). If one register is full of zeros, we get sum of bytes from other register.
pcmpeq %xmm1, %xmm0
pandn MASK2, %xmm0
# SSE5
protb ROT, %xmm0, %xmm0
psadbw %xmm1, %xmm0Depending on instruction set three (SSE) or two (SSE5) additional tables are needed.
movd %xmm0, %eax
Opis biblioteki pthreads na Wikibooks
W ramach głębszego poznawania biblioteki pthreads utworzyłem w serwisie wikibooks opis POSIX Threads, zawierający bardzo podstawowe informacje o większości funkcji tej biblioteki oraz nieco o rozszerzeniach Linuxowej implementacji. Integralną częścią są przykładowe programy.
(Wikibooks niestety nie zyskał takiej popularności, jak Wikipedia, a to całkiem wygodna platforma do tworzenia różnego rodzaju podręczników).
(Wikibooks niestety nie zyskał takiej popularności, jak Wikipedia, a to całkiem wygodna platforma do tworzenia różnego rodzaju podręczników).
wtorek, 30 marca 2010
PostgreSQL: get selected rows with given order
Suppose that database stores some kind of dictionary and user picks some items, but wants to keep order. For example dictionary has entries with id=0..10, and user picked 9, 2, 4 and 0. This simple query does the job (query splitted):
foo = SELECT (ARRAY[9,2,4,0])[i] AS index, i AS ord FROM generate_series(1, 4) AS i; SELECT * FROM dictionary INNER JOIN (foo) ON dictionary.id=foo.index ORDER BY foo.ord