So with all my JPEG decoder analysis, the library that performed the best was the libjpeg SIMD extension. It is several times faster than the normal libjpeg because it uses the modern SSE instructions for the core encoding/decoding routines. Not only does it use the SSE instruction set, but it does so in assembly language – not using compiler intrinsics. This means it is even faster because compiler intrinsics are notorious for producing inefficient or even wrong code. Unfortunately, it is only 32-bit (x86) – and trying to compile a 64-bit version of the library would mean porting over all the assembly language code.
At first glance, porting the assembly code from 32-bit to 64-bit seems intimidating, but after a while you realize that there are many versions of the same thing. Each routine is coded in MMX, 3Dnow, regular SSE, SSE2. And all these options can be applied to the slow integer, fast integer, and floating-point algorithms so you get dozens of assembly language files that you really just don’t need. The fastest combination is SSE2 and fast integer math. We can remove everything else because almost all recent processors in the past 5 years or so support SSE2 (Intel Pentium 4 and Intel Pentium M and AMD Athlon 64 and up). The fast integer math algorithms might cause a small reduction in image quality, but it’s not very noticeable and you get a solid 3% improvement in speed. Disabling everything but SSE2 fast integer algorithms leaves you only with 10 assembly language files to modify, which isn’t too bad.
Now don’t get me wrong, as you can see from my previous post, converting from 32-bit to 64-bit assembly is a giant pain. It took me several days, putting in hours each day to carefully convert and debug each one to make sure it was at least working with my data. I finally got it all seemingly working 64-bit (it doesn’t crash when loading some images off Facebook or from my camera), which is quite a good feeling because to my knowledge nobody else has done that.
I even mexed my port in a 64-bit Matlab mex function named readjpegfast. Unfortunately, Matlab stores its images all weird so a lot of time is wasted just re-arranging the pixel data into a column-first, RGB plane-separated format. For small images roughly 640×480, I get an impressive ~2X improvement on loading 2000 images over imread (Intel Core2 Duo 2.4 GHz T8300 laptop):
>> tic, for i = 0:2000, img = imread(sprintf(‘%0.4d.jpg’, i)); end, toc
Elapsed time is 21.120399 seconds.
>> tic, for i = 0:2000, img = readjpegfast(sprintf(‘%0.4d.jpg’, i)); end, toc
Elapsed time is 9.714863 seconds.
Larger images unfortunately don’t fair so well just because I have to do this format conversion (it would be better if I modified the library to load images into memory using the Matlab format, but that’s way too much work. The 7 megapixel pictures from my camera only saw about a 1.25X improvement:
>> tic, for i = 0:35, img = imread(sprintf(‘big\\%0.4d.jpg’, i)); end, toc
Elapsed time is 13.393228 seconds.
>> tic, for i = 0:35, img = readjpegfast(sprintf(‘big\\%0.4d.jpg’, i)); end, toc
Elapsed time is 10.068488 seconds.
Oh well, since I plan to be mostly using this in C++ where the default data-format of libjpeg is perfect for my uses, this is still a huge win. Soon I hope to be releasing a package that includes my 64-bit port of libjpeg SIMD.
“We can remove everything else because almost all recent processors in the past 5 years or so support SSE (Intel Pentium 4 and Intel Pentium M and and AMD Athlon 64 and up).”
Yep, SSE2 is effectively a base requirement for x86-64 support.
Check out the trunk of the TigerVNC subversion repository. We are using libjpeg/SIMD and have working, optimized 64-bit and 32-bit versions in there. Significant optimization was done to the Huffman codec as well (which is all C code.)
Hey DRC, thanks for the heads up. Too bad I didn’t dig deeper when trying to find somebody who had already done the work. I did see the message posts on the tigervnc-devel mailing list discussing the port but I didn’t get the impression it had been completed. You guys should really throw up a link on the TigerVNC website because that is a big deal for lots of high-performance computing people that deal exclusively with 64-bit systems.
Hi, Brian. In response to several requests to use our version of libjpeg/SIMD independently of TigerVNC, I did some cleanup in the build system to allow it to build as a stand-alone library and threw up the following article to describe how to build and use it:
http://www.virtualgl.org/DeveloperInfo/Libjpeg
Please let me know if you run into any problems.
Hi DRG.
I just tried Tigervnc on my 32 bit gentoo system and was impressed. This is a major step forward over older vnc code, I guess the jpeg enhancements are a significant part of it.
I’ve been trying to build Tiger on x64 ubuntu since that’s the target system but as the README points out it’s a bit of jive.
I’ve found some i386 packages have been put up but nothing for x64.
Since you seem well equipped and testing on that platform is there a possibility of posting some binaries or at least build instructions for x64?
I guess if I spend a few days messing about I could work out how to do it but it does seem like a heavy case of reinventing the wheel.
Any instructions from an accomplished wheel-wright would be a great help!
😉