[llvm-commits] [PATCH/RFC] Loop Data Software Prefetching

Mon Jan 7 09:44:17 PST 2013

Hal,

>... For instance - there is only one step
> >   from prefetching to cache line invalidation (once you know that a
> >   loop writes enough data).
> 
> Yes. What exactly do you have in mind?

   Think about using dcbz(cache block zero out/invalidate) instruction in memcopy-like scenario - once you know you are going to _write_ enough data to fill in a cache block/line, and you know you are not _reading_ from the same target location, you can invalidate it to prevent redundant cache fetch.
   If I remember right, this trick works great on some PPC systems :)

Sergei

---
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation

> -----Original Message-----
> From: Hal Finkel [mailto:hfinkel at anl.gov]
> Sent: Saturday, January 05, 2013 6:12 PM
> To: Sergei Larin
> Cc: l >> Commit Messages and Patches for LLVM
> Subject: Re: [llvm-commits] [PATCH/RFC] Loop Data Software Prefetching
> 
> ----- Original Message -----
> > From: "Sergei Larin" <slarin at codeaurora.org>
> > To: "Hal Finkel" <hfinkel at anl.gov>, "l >> Commit Messages and Patches
> > for LLVM" <llvm-commits at cs.uiuc.edu>
> > Sent: Thursday, January 3, 2013 11:39:20 AM
> > Subject: RE: [llvm-commits] [PATCH/RFC] Loop Data Software
> Prefetching
> >
> > Hal,
> >
> >   Thank you for raising this topic. I would love to see this
> >   implemented and extended. For instance - there is only one step
> >   from prefetching to cache line invalidation (once you know that a
> >   loop writes enough data).
> 
> Yes. What exactly do you have in mind? One thing that I'd like to add
> is a loop "cache-volume" calculation, and use the loop dependence
> analysis, so that we can estimate if previously-loaded data will still
> be in cache when it is accessed again. For example, depending on the
> size of the cache, for a[i] = a[i+n] + a[i+n+m], we may or may not need
> to prefetch the first load.
> 
> >  I would also like to see how it can be
> >   integrated with SW pipelining on targets with static scheduling
> >   for improved determinism.
> >
> >   In short, I am +1 on "I'd like to commit this to trunk (turned off
> >   by default)."
> 
> Okay, thanks! I'll post some rebased patches soon.
> 
>  -Hal
> 
> >
> > Sergei
> >
> > ---
> > Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
> > hosted by The Linux Foundation
> >
> > > -----Original Message-----
> > > From: llvm-commits-bounces at cs.uiuc.edu [mailto:llvm-commits-
> > > bounces at cs.uiuc.edu] On Behalf Of Hal Finkel
> > > Sent: Wednesday, January 02, 2013 5:17 PM
> > > To: l >> Commit Messages and Patches for LLVM
> > > Subject: [llvm-commits] [PATCH/RFC] Loop Data Software Prefetching
> > >
> > > Hi again,
> > >
> > > I've started working on a transformation pass to add explicit
> > > prefetch instructions to prefetch loop data into the L1 cache. The
> > > usefulness of this kind of transformation is obviously target
> > > dependent (and, moreover, often dependent on how the hardware
> > > prefetcher(s) operate).
> > > Nevertheless, there are some motivational factors common to many
> > > platforms:
> > >
> > > 1. On many platforms, to prevent L1 cache pollution, the hardware
> > > prefetcher does not prefetch into the L1 cache (or, if it does, it
> > > is not aggressive enough to achieve maximal performance). This is
> > > true on some of my PowerPC hardware, and also true on some modern
> > > x86 hardware (such as Nehalem and Barcelona [2]). The hardware
> > > prefetchers will prefetch only into L2 (or L3, etc.) but not into
> > > L1, and so explicit software prefetching is the only way to pre-
> load
> > > data into the L1 cache. While it is often true that out-of-order
> > > cores can hide the latency of L1 misses, explicit prefetching can
> > > still sometimes help.
> > >
> > > 2. Software prefetching is useful for short streams (most
> > > prefetchers require at least 2 misses to establish a prefetching
> > > stream).
> > >
> > > 3. Software prefetching is useful for irregular (but predicable)
> > > data access patterns.
> > >
> > > 4. Software prefetching is useful when prefetching all of the
> > > necessary data for a loop would exceed the number of streams that
> > > the hardware can handle. The number of streams is often fairly
> > > limited (~8-32), and the streams are often shared between cores
> (and
> > > hardware threads) for upper-level caches. In cases where a large
> > > number of streams would be needed, software prefetching may be the
> > > only way to prefetch data for the loop.
> > >
> > > 5. Hardware prefetching often cannot cross page boundaries [1], and
> > > so software prefetching is necessary to prevent misses on page
> > > boundaries (and pages can be ~4 kB on many systems).
> > >
> > > The initial version of this pass is fairly simple. It uses
> > > CodeMetrics to estimate the number of cycles necessary to execute
> > > the loop body, and divides that by a heuristic prefetch latency to
> > > calculate for how many loop iterations ahead to prefetch data. It
> > > then inserts prefetch instructions after every load (but not for
> > > loads within one cache line size of some already-prefetched load to
> > > avoid double-prefetching cache lines). This is fairly effective on
> > > my PowerPC hardware, and (somewhat to my surprise), is sometimes
> > > beneficial on my x86 test machine. To be clear, using this pass
> > > often produces slowdowns on my Xeon testing system (more often than
> > > speedups), so it would certainly need some work to be generally
> > > applicable. If anyone is interested in working on this with me,
> > > please let me know.
> > >
> > > Some future work (in no particular order):
> > >
> > > 1. Use VTTI instead of (or in addition to) CodeMetrics in order to
> > > get a more-accurate estimate of the loop iteration cost.
> > >
> > > 2. Use loop dependence analysis to inhibit prefetching of loads
> > > we've just recently accessed in previous iterations (and maybe
> > > nearby
> > > data?)
> > >
> > > 3. Additional heuristics to limit prefetch insertion when we have
> > > smarter hardware (like on x86) that needs help only with
> > > more-difficult cases
> > >
> > > 4. node->next prefetching for linked-list iteration
> > >
> > > In short, I'm sending this e-mail as (hopefully) a conversation-
> > > starter. As is, the pass is quite useful for me, and I'd like to
> > > know what kinds of things need to happen to make it useful more
> > > generally. I have only a very basic idea of what this means for
> > > smarter hardware and ooo cores, so feedback is certainly welcome.
> > >
> > > That having been said, I'd like to commit this to trunk (turned off
> > > by default). As a side note, gcc has -fprefetch-loop-arrays, and we
> > > could similarly add a Clang to enable this pass.
> > >
> > > Thanks again,
> > > Hal
> > >
> > > Some good references are:
> > > [1] Memory part 5: What programmers can do.
> > > Ulrich Drepper, 2007.
> > > http://lwn.net/Articles/255364/
> > >
> > > [2] When Prefetching Works, When It Doesn’t, and Why Jaekyu Lee,
> > > Hyesoon Kim, and Richard Vuduc, 2012.
> > > http://vuduc.org/pubs/lee2012-taco.pdf
> > >
> > > P.S. It occurs to me that this probably won't apply against today's
> > > trunk because of the header renaming, but I can post a rebased
> patch
> > > soon.
> > > --
> > > Hal Finkel
> > > Postdoctoral Appointee
> > > Leadership Computing Facility
> > > Argonne National Laboratory
> >
> >
> 
> --
> Hal Finkel
> Postdoctoral Appointee
> Leadership Computing Facility
> Argonne National Laboratory