Milo Yip has compared different itoa and dtoa implementations on Core i7, including my itoa algorithm 2, that use SSE2 instructions.
Results for itoa are interesting: SSE2 version is not as good as it seemed to be. Tricky branchlut algorithm is only 10% slower, moreover is perfectly portable. One obvious drawback of this method is using lookup-table - in real environment where is a big pressure on cache, memory access could be a bottleneck.