[Zlib-devel] DEFLATE performance improvements v1

Fri Dec 13 18:13:09 EST 2013

Hi Folks,

This patch series introduces a number of deflate performance improvements.
These improvements include two new deflate strategies, quick and medium,
as well as various improvements such as a faster hash function,
PCLMULQDQ-optimized CRC folding, and SSE2 hash shifting. 

Changelog:
    - General
        - fixed CPUID check for 32-bit PIC
        - removed trailing whitespaces from various files
        - likely/unlikely attributes are now in zutil.h, and wrapped by a
          check for GCC. They are now exposed as zlikely and zunlikely.
        - explicit check for x86_64 architecture with -m32 CFLAGS to switch to
          i*86 and for i*86 architectures with -m64 CFLAGS
        - switch from uname -p to uname -m
    - Deflate Quick Strategy:
        - deflate_quick.c is now built separately from deflate.c, and is built
          with -msse4
        - changed the constraint in quick_insert_string() from "p" to "r", as
          some versions of clang don't support "p"
        - added a separate compare258 implementation for 32-bit PIC
    - Deflate Medium Strategy:
        - cleaned up some formating
    - CRC folding
        - intrinsics variables are no longer exposed in deflate.h. Instead,
          crc registers are manually save/restored in the appropriate
          crc_folding functions.

While rerunning the performance tests on this revision, we noticed,
especially for level 1, a significantly high variance among some of the results.
We made a few changes to the benchmark, to further minimize this noise, and
reran the results. As such, the results are revised to:

Compression Corpora: Calgary, normal and large, Canterbury, normal and large,
                     and Silesia
Processor:  Intel i5-2540M @ 2.60 GHz
Compared with git commit 50893291621658f355bc5b4d450a8d06a563053d

Level 9, on average, is 23% (1.31x faster) with no change in compression.

Level 6, on average, is 43% (1.76x faster) with negligible change in
compression.

Level 1, on average, is 60% (2.43x faster) with a sacrifice of about 30%
compression.

The exact performance of a particular workload is very data dependant. For
example, the performance at level 1 for some files, such as pic and ptt5,
is 69% (3.19x faster).

As such, I've posted the full git tree (5089329 + these patches) to github, and
would appreciate if others would run their own benchmarks and post their
results.

https://github.com/jtkukunas/zlib.git

Thanks.

--
Jim Kukunas
Intel Open Source Technology Center