[Zlib-devel] [PATCH 07/12] Adds SSE2 optimized hash shifting to fill_window.

Tue Dec 31 17:57:48 EST 2013

"Assume that the L1 latency is 4 cycles and the L2 latency is 11 cycles,
that the hardware does not autonomously prefetch,"

since this is explicitly x86 (SSE) code... the cpu's that can run this
code will ALL prefetch! This linear access pattern is the easiest for
any hardware prefetcher.

In fact, they pretty much all will run the code in an out of order
engine, which has the effect of unrolling the loop.. but in the
hardware.
and creating all the cache loads/etc etc quite well.
the code here has the control logic (including the array index stuff)
and the payload completely separate, so the out of order/speculation
engine is pretty much perfect here.

unrolling actually creates more register pressure, which often ends up
hindering, not helping the out of order engines in cpus.

control logic is not free, but runs concurrently with SSE instructions
on modern x86 CPUs (like less than a decade old) since they occupy
different execution slots.

my suspicion is that at best, the hand unrolled version will be
exactly the same speed.... but at quite a bit less readable code
(unrolled loops tend to end up that way).
(esp since it then means you need to handle ends separately)

On Tue, Dec 31, 2013 at 2:02 AM, John Reiser <jreiser at bitwagon.com> wrote:
> On 12/30/2013 03:51 PM, Arjan van de Ven wrote:
>> Loops are cheap... Very cheap. At least on more modern CPUs. It often helps to unroll less, not more.
>> (See one of the other patches);
>
> Control cycles are cheap, but they are not free.  The question is whether
> the control simultaneously can minimize effective cache latency and maximize
> the overlap with computation.
>
> Assume that the L1 latency is 4 cycles and the L2 latency is 11 cycles,
> that the hardware does not autonomously prefetch,
> and that the array is larger than the L1 data cache but smaller than the L2.
> In the steady state then I believe that the presented code requires at least
> (11 + 4*(4 + 2)) cycles (one L2 latency plus four repetitions of an L1 latency
> followed by 2 cycles of compute + store) to process 32 elements.
> Overlapping and pipelining to hide the cache latencies ought to achieve
> the minimum, which is (1+ (4 + 4)) cycles per 32 elements (an explicit
> prefetch from L2, 4 fetches from L1, and 4 stores to L1, with the
> subtractions pipelined and overlapped with the cache activity.)
> The simplest way seems to involve unrolling 4X.
>
>
> _______________________________________________
> Zlib-devel mailing list
> Zlib-devel at madler.net
> http://mail.madler.net/mailman/listinfo/zlib-devel_madler.net