So with all my JPEG decoder analysis, the library that performed the best was the libjpeg SIMD extension. It is several times faster than the normal libjpeg because it uses the modern SSE instructions for the core encoding/decoding routines. Not only does it use the SSE instruction set, but it does so in assembly language – not using compiler intrinsics. This means it is even faster because compiler intrinsics are notorious for producing inefficient or even wrong code. Unfortunately, it is only 32-bit (x86) – and trying to compile a 64-bit version of the library would mean porting over all the assembly language code.
At first glance, porting the assembly code from 32-bit to 64-bit seems intimidating, but after a while you realize that there are many versions of the same thing. Each routine is coded in MMX, 3Dnow, regular SSE, SSE2. And all these options can be applied to the slow integer, fast integer, and floating-point algorithms so you get dozens of assembly language files that you really just don’t need. The fastest combination is SSE2 and fast integer math. We can remove everything else because almost all recent processors in the past 5 years or so support SSE2 (Intel Pentium 4 and Intel Pentium M and AMD Athlon 64 and up). The fast integer math algorithms might cause a small reduction in image quality, but it’s not very noticeable and you get a solid 3% improvement in speed. Disabling everything but SSE2 fast integer algorithms leaves you only with 10 assembly language files to modify, which isn’t too bad.
Now don’t get me wrong, as you can see from my previous post, converting from 32-bit to 64-bit assembly is a giant pain. It took me several days, putting in hours each day to carefully convert and debug each one to make sure it was at least working with my data. I finally got it all seemingly working 64-bit (it doesn’t crash when loading some images off Facebook or from my camera), which is quite a good feeling because to my knowledge nobody else has done that.
I even mexed my port in a 64-bit Matlab mex function named readjpegfast. Unfortunately, Matlab stores its images all weird so a lot of time is wasted just re-arranging the pixel data into a column-first, RGB plane-separated format. For small images roughly 640×480, I get an impressive ~2X improvement on loading 2000 images over imread (Intel Core2 Duo 2.4 GHz T8300 laptop):
>> tic, for i = 0:2000, img = imread(sprintf(‘%0.4d.jpg’, i)); end, toc
Elapsed time is 21.120399 seconds.
>> tic, for i = 0:2000, img = readjpegfast(sprintf(‘%0.4d.jpg’, i)); end, toc
Elapsed time is 9.714863 seconds.
Larger images unfortunately don’t fair so well just because I have to do this format conversion (it would be better if I modified the library to load images into memory using the Matlab format, but that’s way too much work. The 7 megapixel pictures from my camera only saw about a 1.25X improvement:
>> tic, for i = 0:35, img = imread(sprintf(‘big\\%0.4d.jpg’, i)); end, toc
Elapsed time is 13.393228 seconds.
>> tic, for i = 0:35, img = readjpegfast(sprintf(‘big\\%0.4d.jpg’, i)); end, toc
Elapsed time is 10.068488 seconds.
Oh well, since I plan to be mostly using this in C++ where the default data-format of libjpeg is perfect for my uses, this is still a huge win. Soon I hope to be releasing a package that includes my 64-bit port of libjpeg SIMD.