[llvm-commits] [PATCH/RFC] Loop Data Software Prefetching

Fri Jan 4 19:31:06 PST 2013

Hi, Hal:

   Sorry for the disturbing, bunch of dumb newbie questions:

  1) How do you know how many iterations ahead is adequate? Which VTTI 
interface tell you the latency of promoting
    datum from L2 => L1? If the loop with SW prefetching instruction 
inserted are unrolled by some amounter later on,
    do you adjust the prefetching code accordingly?

  2) Is "L1" the top most level in the memory hierarchy (i.e. no L0)? If 
this is the case (and if my memory serve me right),
     I think this change won't help FP programs on Itanium, as FP datum 
is loaded from L1.

  3) I saw you almost "hardcode" some u-arch parameters (about memory 
hierarchy) at the beginning of
    the patch? Is it for the time being? I guess Poly folks may have 
some magic interface for that end.
    Actually, I'm looking for some interface. If you already know that, 
please let me know. Profuse thanks
    in advance!

  4). Is this work based on existing published work?

Have a nice weekend!
Shuxin

On 01/02/2013 03:16 PM, Hal Finkel wrote:
> Hi again,
>
> I've started working on a transformation pass to add explicit prefetch instructions to prefetch loop data into the L1 cache. The usefulness of this kind of transformation is obviously target dependent (and, moreover, often dependent on how the hardware prefetcher(s) operate). Nevertheless, there are some motivational factors common to many platforms:
>
> 1. On many platforms, to prevent L1 cache pollution, the hardware prefetcher does not prefetch into the L1 cache (or, if it does, it is not aggressive enough to achieve maximal performance). This is true on some of my PowerPC hardware, and also true on some modern x86 hardware (such as Nehalem and Barcelona [2]). The hardware prefetchers will prefetch only into L2 (or L3, etc.) but not into L1, and so explicit software prefetching is the only way to pre-load data into the L1 cache. While it is often true that out-of-order cores can hide the latency of L1 misses, explicit prefetching can still sometimes help.
>
> 2. Software prefetching is useful for short streams (most prefetchers require at least 2 misses to establish a prefetching stream).
>
> 3. Software prefetching is useful for irregular (but predicable) data access patterns.
>
> 4. Software prefetching is useful when prefetching all of the necessary data for a loop would exceed the number of streams that the hardware can handle. The number of streams is often fairly limited (~8-32), and the streams are often shared between cores (and hardware threads) for upper-level caches. In cases where a large number of streams would be needed, software prefetching may be the only way to prefetch data for the loop.
>
> 5. Hardware prefetching often cannot cross page boundaries [1], and so software prefetching is necessary to prevent misses on page boundaries (and pages can be ~4 kB on many systems).
>
> The initial version of this pass is fairly simple. It uses CodeMetrics to estimate the number of cycles necessary to execute the loop body, and divides that by a heuristic prefetch latency to calculate for how many loop iterations ahead to prefetch data. It then inserts prefetch instructions after every load (but not for loads within one cache line size of some already-prefetched load to avoid double-prefetching cache lines). This is fairly effective on my PowerPC hardware, and (somewhat to my surprise), is sometimes beneficial on my x86 test machine. To be clear, using this pass often produces slowdowns on my Xeon testing system (more often than speedups), so it would certainly need some work to be generally applicable. If anyone is interested in working on this with me, please let me know.
>
> Some future work (in no particular order):
>
> 1. Use VTTI instead of (or in addition to) CodeMetrics in order to get a more-accurate estimate of the loop iteration cost.
>
> 2. Use loop dependence analysis to inhibit prefetching of loads we've just recently accessed in previous iterations (and maybe nearby data?)
>
> 3. Additional heuristics to limit prefetch insertion when we have smarter hardware (like on x86) that needs help only with more-difficult cases
>
> 4. node->next prefetching for linked-list iteration
>
> In short, I'm sending this e-mail as (hopefully) a conversation-starter. As is, the pass is quite useful for me, and I'd like to know what kinds of things need to happen to make it useful more generally. I have only a very basic idea of what this means for smarter hardware and ooo cores, so feedback is certainly welcome.
>
> That having been said, I'd like to commit this to trunk (turned off by default). As a side note, gcc has -fprefetch-loop-arrays, and we could similarly add a Clang to enable this pass.
>
> Thanks again,
> Hal
>
> Some good references are:
> [1] Memory part 5: What programmers can do.
> Ulrich Drepper, 2007.
> http://lwn.net/Articles/255364/
>
> [2] When Prefetching Works, When It Doesn't, and Why
> Jaekyu Lee, Hyesoon Kim, and Richard Vuduc, 2012.
> http://vuduc.org/pubs/lee2012-taco.pdf
>
> P.S. It occurs to me that this probably won't apply against today's trunk because of the header renaming, but I can post a rebased patch soon.
>
>
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20130104/f0614e7a/attachment.html>