[llvm-commits] [PATCH/RFC] Loop Data Software Prefetching

Mon Jan 7 12:37:26 PST 2013

On Jan 5, 2013, at 5:01 PM, Hal Finkel <hfinkel at anl.gov> wrote:

> ----- Original Message -----
>> From: "Evan Cheng" <evan.cheng at apple.com>
>> To: "Hal Finkel" <hfinkel at anl.gov>
>> Cc: "l >> Commit Messages and Patches for LLVM" <llvm-commits at cs.uiuc.edu>
>> Sent: Friday, January 4, 2013 6:49:16 PM
>> Subject: Re: [llvm-commits] [PATCH/RFC] Loop Data Software Prefetching
>> 
>> Thanks for working on this. SW prefetching can introduce some
>> overhead that might be difficult to estimate in a LLVM IR pass.
> 
> Agreed. For one thing, the extra addressing calculations can increase register pressure in some cases. It is also hard to estimate the number of cycles it will take to execute the loop body (although our new TTI/CostModel infrastructure should help somewhat).
> 
>> Have
>> you considered implementing a MI pass so it can utilize analysis
>> such as MachineTraceMetrics?
> 
> I did not consider it because I did not know it existed ;) The critical path calculation looks like it would be very useful for computing the prefetch distance. We could construct a MI-level pass by creating a TII callback for adding prefetches and making use of the MMOs with SE (and maybe the new dependence analysis) to get higher-level loop information.
> 
> Nevertheless, I'm not sure sure whether the extra accuracy gained from operating at the MI level will really help because the prefetch distances are generally only rough estimates (being off my a few hundred cycles might be okay) and the real work seems to be in determining whether preteching at all will help or hurt. It might be easier to experiment with the basic heuristics at the IR level and then move to the MI level if and when necessary. Also, at the IR level the extra addressing calculations contribute to LSR, get DAGCombined, etc. (I'm not yet sure how important those things are either).
> 
> What do you suggest?

I agree for experimental purpose what you have is fine. But I suspect you will need to move to a MI pass if you want something that works well across the board. I'm ok with the pass going in just not enabled by default.

Thanks,

Evan

> 
> Thanks again,
> Hal
> 
>> 
>> Evan
>> 
>> On Jan 2, 2013, at 3:16 PM, Hal Finkel <hfinkel at anl.gov> wrote:
>> 
>>> Hi again,
>>> 
>>> I've started working on a transformation pass to add explicit
>>> prefetch instructions to prefetch loop data into the L1 cache. The
>>> usefulness of this kind of transformation is obviously target
>>> dependent (and, moreover, often dependent on how the hardware
>>> prefetcher(s) operate). Nevertheless, there are some motivational
>>> factors common to many platforms:
>>> 
>>> 1. On many platforms, to prevent L1 cache pollution, the hardware
>>> prefetcher does not prefetch into the L1 cache (or, if it does, it
>>> is not aggressive enough to achieve maximal performance). This is
>>> true on some of my PowerPC hardware, and also true on some modern
>>> x86 hardware (such as Nehalem and Barcelona [2]). The hardware
>>> prefetchers will prefetch only into L2 (or L3, etc.) but not into
>>> L1, and so explicit software prefetching is the only way to
>>> pre-load data into the L1 cache. While it is often true that
>>> out-of-order cores can hide the latency of L1 misses, explicit
>>> prefetching can still sometimes help.
>>> 
>>> 2. Software prefetching is useful for short streams (most
>>> prefetchers require at least 2 misses to establish a prefetching
>>> stream).
>>> 
>>> 3. Software prefetching is useful for irregular (but predicable)
>>> data access patterns.
>>> 
>>> 4. Software prefetching is useful when prefetching all of the
>>> necessary data for a loop would exceed the number of streams that
>>> the hardware can handle. The number of streams is often fairly
>>> limited (~8-32), and the streams are often shared between cores
>>> (and hardware threads) for upper-level caches. In cases where a
>>> large number of streams would be needed, software prefetching may
>>> be the only way to prefetch data for the loop.
>>> 
>>> 5. Hardware prefetching often cannot cross page boundaries [1], and
>>> so software prefetching is necessary to prevent misses on page
>>> boundaries (and pages can be ~4 kB on many systems).
>>> 
>>> The initial version of this pass is fairly simple. It uses
>>> CodeMetrics to estimate the number of cycles necessary to execute
>>> the loop body, and divides that by a heuristic prefetch latency to
>>> calculate for how many loop iterations ahead to prefetch data. It
>>> then inserts prefetch instructions after every load (but not for
>>> loads within one cache line size of some already-prefetched load
>>> to avoid double-prefetching cache lines). This is fairly effective
>>> on my PowerPC hardware, and (somewhat to my surprise), is
>>> sometimes beneficial on my x86 test machine. To be clear, using
>>> this pass often produces slowdowns on my Xeon testing system (more
>>> often than speedups), so it would certainly need some work to be
>>> generally applicable. If anyone is interested in working on this
>>> with me, please let me know.
>>> 
>>> Some future work (in no particular order):
>>> 
>>> 1. Use VTTI instead of (or in addition to) CodeMetrics in order to
>>> get a more-accurate estimate of the loop iteration cost.
>>> 
>>> 2. Use loop dependence analysis to inhibit prefetching of loads
>>> we've just recently accessed in previous iterations (and maybe
>>> nearby data?)
>>> 
>>> 3. Additional heuristics to limit prefetch insertion when we have
>>> smarter hardware (like on x86) that needs help only with
>>> more-difficult cases
>>> 
>>> 4. node->next prefetching for linked-list iteration
>>> 
>>> In short, I'm sending this e-mail as (hopefully) a
>>> conversation-starter. As is, the pass is quite useful for me, and
>>> I'd like to know what kinds of things need to happen to make it
>>> useful more generally. I have only a very basic idea of what this
>>> means for smarter hardware and ooo cores, so feedback is certainly
>>> welcome.
>>> 
>>> That having been said, I'd like to commit this to trunk (turned off
>>> by default). As a side note, gcc has -fprefetch-loop-arrays, and
>>> we could similarly add a Clang to enable this pass.
>>> 
>>> Thanks again,
>>> Hal
>>> 
>>> Some good references are:
>>> [1] Memory part 5: What programmers can do.
>>> Ulrich Drepper, 2007.
>>> http://lwn.net/Articles/255364/
>>> 
>>> [2] When Prefetching Works, When It Doesn’t, and Why
>>> Jaekyu Lee, Hyesoon Kim, and Richard Vuduc, 2012.
>>> http://vuduc.org/pubs/lee2012-taco.pdf
>>> 
>>> P.S. It occurs to me that this probably won't apply against today's
>>> trunk because of the header renaming, but I can post a rebased
>>> patch soon.
>>> --
>>> Hal Finkel
>>> Postdoctoral Appointee
>>> Leadership Computing Facility
>>> Argonne National Laboratory
>>> <llvm-ldp.patch>_______________________________________________
>>> llvm-commits mailing list
>>> llvm-commits at cs.uiuc.edu
>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>> 
>> 
> 
> -- 
> Hal Finkel
> Postdoctoral Appointee
> Leadership Computing Facility
> Argonne National Laboratory