[llvm-commits] [PATCH/RFC] Loop Data Software Prefetching

Hal Finkel hfinkel at anl.gov
Tue Jan 29 12:39:39 PST 2013


----- Original Message -----
> From: "Evan Cheng" <evan.cheng at apple.com>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: "l >> Commit Messages and Patches for LLVM" <llvm-commits at cs.uiuc.edu>
> Sent: Monday, January 7, 2013 2:37:26 PM
> Subject: Re: [llvm-commits] [PATCH/RFC] Loop Data Software Prefetching
> 
> 
> On Jan 5, 2013, at 5:01 PM, Hal Finkel <hfinkel at anl.gov> wrote:
> 
> > ----- Original Message -----
> >> From: "Evan Cheng" <evan.cheng at apple.com>
> >> To: "Hal Finkel" <hfinkel at anl.gov>
> >> Cc: "l >> Commit Messages and Patches for LLVM"
> >> <llvm-commits at cs.uiuc.edu>
> >> Sent: Friday, January 4, 2013 6:49:16 PM
> >> Subject: Re: [llvm-commits] [PATCH/RFC] Loop Data Software
> >> Prefetching
> >> 
> >> Thanks for working on this. SW prefetching can introduce some
> >> overhead that might be difficult to estimate in a LLVM IR pass.
> > 
> > Agreed. For one thing, the extra addressing calculations can
> > increase register pressure in some cases. It is also hard to
> > estimate the number of cycles it will take to execute the loop
> > body (although our new TTI/CostModel infrastructure should help
> > somewhat).
> > 
> >> Have
> >> you considered implementing a MI pass so it can utilize analysis
> >> such as MachineTraceMetrics?
> > 
> > I did not consider it because I did not know it existed ;) The
> > critical path calculation looks like it would be very useful for
> > computing the prefetch distance. We could construct a MI-level
> > pass by creating a TII callback for adding prefetches and making
> > use of the MMOs with SE (and maybe the new dependence analysis) to
> > get higher-level loop information.
> > 
> > Nevertheless, I'm not sure sure whether the extra accuracy gained
> > from operating at the MI level will really help because the
> > prefetch distances are generally only rough estimates (being off
> > my a few hundred cycles might be okay) and the real work seems to
> > be in determining whether preteching at all will help or hurt. It
> > might be easier to experiment with the basic heuristics at the IR
> > level and then move to the MI level if and when necessary. Also,
> > at the IR level the extra addressing calculations contribute to
> > LSR, get DAGCombined, etc. (I'm not yet sure how important those
> > things are either).
> > 
> > What do you suggest?
> 
> I agree for experimental purpose what you have is fine. But I suspect
> you will need to move to a MI pass if you want something that works
> well across the board. I'm ok with the pass going in just not
> enabled by default.

I've attached a rebased patch for review. Specifically, I have two issues that I'd like to discuss.

1. I'm unsatisfied with the TTI integration. I would like the target to be able to override the default values, but the user to be able to specify alternate parameters. Currently, the base TTI implementation takes its parameters from the command-line parameters, but if a target overrides TTI, then the command-line parameters are ignored.

2. Applying this patch causes three regression test failures:
    LLVM :: CodeGen/X86/lsr-negative-stride.ll
    LLVM :: Transforms/LoopStrengthReduce/X86/2011-12-04-loserreg.ll
    LLVM :: Transforms/LoopStrengthReduce/X86/2012-01-13-phielim.ll

these failures are triggered just by the fact that the pass is scheduled prior to LSR running, even though it does not change anything. Does anyone know how this might happen?

Thanks again,
Hal

> 
> Thanks,
> 
> Evan
> 
> > 
> > Thanks again,
> > Hal
> > 
> >> 
> >> Evan
> >> 
> >> On Jan 2, 2013, at 3:16 PM, Hal Finkel <hfinkel at anl.gov> wrote:
> >> 
> >>> Hi again,
> >>> 
> >>> I've started working on a transformation pass to add explicit
> >>> prefetch instructions to prefetch loop data into the L1 cache.
> >>> The
> >>> usefulness of this kind of transformation is obviously target
> >>> dependent (and, moreover, often dependent on how the hardware
> >>> prefetcher(s) operate). Nevertheless, there are some motivational
> >>> factors common to many platforms:
> >>> 
> >>> 1. On many platforms, to prevent L1 cache pollution, the hardware
> >>> prefetcher does not prefetch into the L1 cache (or, if it does,
> >>> it
> >>> is not aggressive enough to achieve maximal performance). This is
> >>> true on some of my PowerPC hardware, and also true on some modern
> >>> x86 hardware (such as Nehalem and Barcelona [2]). The hardware
> >>> prefetchers will prefetch only into L2 (or L3, etc.) but not into
> >>> L1, and so explicit software prefetching is the only way to
> >>> pre-load data into the L1 cache. While it is often true that
> >>> out-of-order cores can hide the latency of L1 misses, explicit
> >>> prefetching can still sometimes help.
> >>> 
> >>> 2. Software prefetching is useful for short streams (most
> >>> prefetchers require at least 2 misses to establish a prefetching
> >>> stream).
> >>> 
> >>> 3. Software prefetching is useful for irregular (but predicable)
> >>> data access patterns.
> >>> 
> >>> 4. Software prefetching is useful when prefetching all of the
> >>> necessary data for a loop would exceed the number of streams that
> >>> the hardware can handle. The number of streams is often fairly
> >>> limited (~8-32), and the streams are often shared between cores
> >>> (and hardware threads) for upper-level caches. In cases where a
> >>> large number of streams would be needed, software prefetching may
> >>> be the only way to prefetch data for the loop.
> >>> 
> >>> 5. Hardware prefetching often cannot cross page boundaries [1],
> >>> and
> >>> so software prefetching is necessary to prevent misses on page
> >>> boundaries (and pages can be ~4 kB on many systems).
> >>> 
> >>> The initial version of this pass is fairly simple. It uses
> >>> CodeMetrics to estimate the number of cycles necessary to execute
> >>> the loop body, and divides that by a heuristic prefetch latency
> >>> to
> >>> calculate for how many loop iterations ahead to prefetch data. It
> >>> then inserts prefetch instructions after every load (but not for
> >>> loads within one cache line size of some already-prefetched load
> >>> to avoid double-prefetching cache lines). This is fairly
> >>> effective
> >>> on my PowerPC hardware, and (somewhat to my surprise), is
> >>> sometimes beneficial on my x86 test machine. To be clear, using
> >>> this pass often produces slowdowns on my Xeon testing system
> >>> (more
> >>> often than speedups), so it would certainly need some work to be
> >>> generally applicable. If anyone is interested in working on this
> >>> with me, please let me know.
> >>> 
> >>> Some future work (in no particular order):
> >>> 
> >>> 1. Use VTTI instead of (or in addition to) CodeMetrics in order
> >>> to
> >>> get a more-accurate estimate of the loop iteration cost.
> >>> 
> >>> 2. Use loop dependence analysis to inhibit prefetching of loads
> >>> we've just recently accessed in previous iterations (and maybe
> >>> nearby data?)
> >>> 
> >>> 3. Additional heuristics to limit prefetch insertion when we have
> >>> smarter hardware (like on x86) that needs help only with
> >>> more-difficult cases
> >>> 
> >>> 4. node->next prefetching for linked-list iteration
> >>> 
> >>> In short, I'm sending this e-mail as (hopefully) a
> >>> conversation-starter. As is, the pass is quite useful for me, and
> >>> I'd like to know what kinds of things need to happen to make it
> >>> useful more generally. I have only a very basic idea of what this
> >>> means for smarter hardware and ooo cores, so feedback is
> >>> certainly
> >>> welcome.
> >>> 
> >>> That having been said, I'd like to commit this to trunk (turned
> >>> off
> >>> by default). As a side note, gcc has -fprefetch-loop-arrays, and
> >>> we could similarly add a Clang to enable this pass.
> >>> 
> >>> Thanks again,
> >>> Hal
> >>> 
> >>> Some good references are:
> >>> [1] Memory part 5: What programmers can do.
> >>> Ulrich Drepper, 2007.
> >>> http://lwn.net/Articles/255364/
> >>> 
> >>> [2] When Prefetching Works, When It Doesn’t, and Why
> >>> Jaekyu Lee, Hyesoon Kim, and Richard Vuduc, 2012.
> >>> http://vuduc.org/pubs/lee2012-taco.pdf
> >>> 
> >>> P.S. It occurs to me that this probably won't apply against
> >>> today's
> >>> trunk because of the header renaming, but I can post a rebased
> >>> patch soon.
> >>> --
> >>> Hal Finkel
> >>> Postdoctoral Appointee
> >>> Leadership Computing Facility
> >>> Argonne National Laboratory
> >>> <llvm-ldp.patch>_______________________________________________
> >>> llvm-commits mailing list
> >>> llvm-commits at cs.uiuc.edu
> >>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
> >> 
> >> 
> > 
> > --
> > Hal Finkel
> > Postdoctoral Appointee
> > Leadership Computing Facility
> > Argonne National Laboratory
> 
> 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: loopdatapref.patch
Type: text/x-patch
Size: 17208 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20130129/9bdf9a84/attachment.bin>


More information about the llvm-commits mailing list