[llvm-commits] [PATCH/RFC] Loop Data Software Prefetching

Sat Jan 5 17:29:30 PST 2013

----- Original Message -----
> From: "Shuxin Yang" <shuxin.llvm at gmail.com>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: "l >> Commit Messages and Patches for LLVM" <llvm-commits at cs.uiuc.edu>
> Sent: Friday, January 4, 2013 9:31:06 PM
> Subject: Re: [llvm-commits] [PATCH/RFC] Loop Data Software Prefetching
> 
> 
> Hi, Hal:
> 
> Sorry for the disturbing, bunch of dumb newbie questions:

Not a problem; feedback it what I wanted :)

> 
> 1) How do you know how many iterations ahead is adequate? Which VTTI
> interface tell you the latency of promoting
> datum from L2 => L1?

None of them yet. We'll need to add these. Currently, there is just a command-line parameter which defaults to 300 cycles (this seems to be a reasonable default for data in L2 but not in L1 on both my PowerPC hardware and my x86 test machine). Eventually, we'll need several callbacks.

> If the loop with SW prefetching instruction
> inserted are unrolled by some amounter later on,
> do you adjust the prefetching code accordingly?

No, I'm currently adding the prefetching only after all of the unrolling is done. Adding prefetches before unrolling will either result in adding too many (or introducing additional control-flow around the prefetches which is probably not a win if the loop is not later unrolled sufficiently). What I mean is that adding { if (i % 8) prefetch(a[i+m]); } is probably only a good idea if the loop is later unrolled by some multiple of 8 (although I could be wrong for some ooo cores).

> 
> 2) Is "L1" the top most level in the memory hierarchy (i.e. no L0)?
> If this is the case (and if my memory serve me right),
> I think this change won't help FP programs on Itanium, as FP datum is
> loaded from L1.

Yes. On the other hand, I know very little about Itanium. This pass does help a lot on my in-order PowerPC hardware.

> 
> 3) I saw you almost "hardcode" some u-arch parameters (about memory
> hierarchy) at the beginning of
> the patch? Is it for the time being?

Yes. These things should be part of TTI.

> I guess Poly folks may have some
> magic interface for that end.

I'm not sure they do, but there may be common interest.

> Actually, I'm looking for some interface. If you already know that,
> please let me know. Profuse thanks
> in advance!
> 
> 4). Is this work based on existing published work?

The initial implementation essentially follows the approach suggested in:
David Callahan, Ken Kennedy and Allan Porterfield. Software Prefetching. 1991. ACM.
except that I try to avoid prefetching the same cache line more than once (this is actually a strict requirement on my PowerPC hardware to prevent stalls).

Later research discusses using dependence analysis to eliminate unneeded prefetching, and I'd like to incorporate those techniques as well. For systems (like on x86) that have more-intellegent prefetchers, we'll also want some additional heuristics to restrict prefetch insertion to only those cases where it is actually helpful.

Thanks again,
Hal

> 
> Have a nice weekend!
> Shuxin
> 
> 
> On 01/02/2013 03:16 PM, Hal Finkel wrote:
> 
> 
> Hi again,
> 
> I've started working on a transformation pass to add explicit
> prefetch instructions to prefetch loop data into the L1 cache. The
> usefulness of this kind of transformation is obviously target
> dependent (and, moreover, often dependent on how the hardware
> prefetcher(s) operate). Nevertheless, there are some motivational
> factors common to many platforms:
> 
> 1. On many platforms, to prevent L1 cache pollution, the hardware
> prefetcher does not prefetch into the L1 cache (or, if it does, it
> is not aggressive enough to achieve maximal performance). This is
> true on some of my PowerPC hardware, and also true on some modern
> x86 hardware (such as Nehalem and Barcelona [2]). The hardware
> prefetchers will prefetch only into L2 (or L3, etc.) but not into
> L1, and so explicit software prefetching is the only way to pre-load
> data into the L1 cache. While it is often true that out-of-order
> cores can hide the latency of L1 misses, explicit prefetching can
> still sometimes help.
> 
> 2. Software prefetching is useful for short streams (most prefetchers
> require at least 2 misses to establish a prefetching stream).
> 
> 3. Software prefetching is useful for irregular (but predicable) data
> access patterns.
> 
> 4. Software prefetching is useful when prefetching all of the
> necessary data for a loop would exceed the number of streams that
> the hardware can handle. The number of streams is often fairly
> limited (~8-32), and the streams are often shared between cores (and
> hardware threads) for upper-level caches. In cases where a large
> number of streams would be needed, software prefetching may be the
> only way to prefetch data for the loop.
> 
> 5. Hardware prefetching often cannot cross page boundaries [1], and
> so software prefetching is necessary to prevent misses on page
> boundaries (and pages can be ~4 kB on many systems).
> 
> The initial version of this pass is fairly simple. It uses
> CodeMetrics to estimate the number of cycles necessary to execute
> the loop body, and divides that by a heuristic prefetch latency to
> calculate for how many loop iterations ahead to prefetch data. It
> then inserts prefetch instructions after every load (but not for
> loads within one cache line size of some already-prefetched load to
> avoid double-prefetching cache lines). This is fairly effective on
> my PowerPC hardware, and (somewhat to my surprise), is sometimes
> beneficial on my x86 test machine. To be clear, using this pass
> often produces slowdowns on my Xeon testing system (more often than
> speedups), so it would certainly need some work to be generally
> applicable. If anyone is interested in working on this with me,
> please let me know.
> 
> Some future work (in no particular order):
> 
> 1. Use VTTI instead of (or in addition to) CodeMetrics in order to
> get a more-accurate estimate of the loop iteration cost.
> 
> 2. Use loop dependence analysis to inhibit prefetching of loads we've
> just recently accessed in previous iterations (and maybe nearby
> data?)
> 
> 3. Additional heuristics to limit prefetch insertion when we have
> smarter hardware (like on x86) that needs help only with
> more-difficult cases
> 
> 4. node->next prefetching for linked-list iteration
> 
> In short, I'm sending this e-mail as (hopefully) a
> conversation-starter. As is, the pass is quite useful for me, and
> I'd like to know what kinds of things need to happen to make it
> useful more generally. I have only a very basic idea of what this
> means for smarter hardware and ooo cores, so feedback is certainly
> welcome.
> 
> That having been said, I'd like to commit this to trunk (turned off
> by default). As a side note, gcc has -fprefetch-loop-arrays, and we
> could similarly add a Clang to enable this pass.
> 
> Thanks again,
> Hal
> 
> Some good references are:
> [1] Memory part 5: What programmers can do.
> Ulrich Drepper, 2007. http://lwn.net/Articles/255364/ [2] When
> Prefetching Works, When It Doesn’t, and Why
> Jaekyu Lee, Hyesoon Kim, and Richard Vuduc, 2012.
> http://vuduc.org/pubs/lee2012-taco.pdf P.S. It occurs to me that
> this probably won't apply against today's trunk because of the
> header renaming, but I can post a rebased patch soon.
> 
> _______________________________________________
> llvm-commits mailing list llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> 

-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory