[llvm-commits] [PATCH/RFC] Loop Data Software Prefetching

Sat Jan 5 16:12:05 PST 2013

----- Original Message -----
> From: "Sergei Larin" <slarin at codeaurora.org>
> To: "Hal Finkel" <hfinkel at anl.gov>, "l >> Commit Messages and Patches for LLVM" <llvm-commits at cs.uiuc.edu>
> Sent: Thursday, January 3, 2013 11:39:20 AM
> Subject: RE: [llvm-commits] [PATCH/RFC] Loop Data Software Prefetching
> 
> Hal,
> 
>   Thank you for raising this topic. I would love to see this
>   implemented and extended. For instance - there is only one step
>   from prefetching to cache line invalidation (once you know that a
>   loop writes enough data).

Yes. What exactly do you have in mind? One thing that I'd like to add is a loop "cache-volume" calculation, and use the loop dependence analysis, so that we can estimate if previously-loaded data will still be in cache when it is accessed again. For example, depending on the size of the cache, for a[i] = a[i+n] + a[i+n+m], we may or may not need to prefetch the first load.

>  I would also like to see how it can be
>   integrated with SW pipelining on targets with static scheduling
>   for improved determinism.
> 
>   In short, I am +1 on "I'd like to commit this to trunk (turned off
>   by default)."

Okay, thanks! I'll post some rebased patches soon.

 -Hal

> 
> Sergei
> 
> ---
> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
> hosted by The Linux Foundation
> 
> > -----Original Message-----
> > From: llvm-commits-bounces at cs.uiuc.edu [mailto:llvm-commits-
> > bounces at cs.uiuc.edu] On Behalf Of Hal Finkel
> > Sent: Wednesday, January 02, 2013 5:17 PM
> > To: l >> Commit Messages and Patches for LLVM
> > Subject: [llvm-commits] [PATCH/RFC] Loop Data Software Prefetching
> > 
> > Hi again,
> > 
> > I've started working on a transformation pass to add explicit
> > prefetch
> > instructions to prefetch loop data into the L1 cache. The
> > usefulness of
> > this kind of transformation is obviously target dependent (and,
> > moreover, often dependent on how the hardware prefetcher(s)
> > operate).
> > Nevertheless, there are some motivational factors common to many
> > platforms:
> > 
> > 1. On many platforms, to prevent L1 cache pollution, the hardware
> > prefetcher does not prefetch into the L1 cache (or, if it does, it
> > is
> > not aggressive enough to achieve maximal performance). This is true
> > on
> > some of my PowerPC hardware, and also true on some modern x86
> > hardware
> > (such as Nehalem and Barcelona [2]). The hardware prefetchers will
> > prefetch only into L2 (or L3, etc.) but not into L1, and so
> > explicit
> > software prefetching is the only way to pre-load data into the L1
> > cache. While it is often true that out-of-order cores can hide the
> > latency of L1 misses, explicit prefetching can still sometimes
> > help.
> > 
> > 2. Software prefetching is useful for short streams (most
> > prefetchers
> > require at least 2 misses to establish a prefetching stream).
> > 
> > 3. Software prefetching is useful for irregular (but predicable)
> > data
> > access patterns.
> > 
> > 4. Software prefetching is useful when prefetching all of the
> > necessary
> > data for a loop would exceed the number of streams that the
> > hardware
> > can handle. The number of streams is often fairly limited (~8-32),
> > and
> > the streams are often shared between cores (and hardware threads)
> > for
> > upper-level caches. In cases where a large number of streams would
> > be
> > needed, software prefetching may be the only way to prefetch data
> > for
> > the loop.
> > 
> > 5. Hardware prefetching often cannot cross page boundaries [1], and
> > so
> > software prefetching is necessary to prevent misses on page
> > boundaries
> > (and pages can be ~4 kB on many systems).
> > 
> > The initial version of this pass is fairly simple. It uses
> > CodeMetrics
> > to estimate the number of cycles necessary to execute the loop
> > body,
> > and divides that by a heuristic prefetch latency to calculate for
> > how
> > many loop iterations ahead to prefetch data. It then inserts
> > prefetch
> > instructions after every load (but not for loads within one cache
> > line
> > size of some already-prefetched load to avoid double-prefetching
> > cache
> > lines). This is fairly effective on my PowerPC hardware, and
> > (somewhat
> > to my surprise), is sometimes beneficial on my x86 test machine. To
> > be
> > clear, using this pass often produces slowdowns on my Xeon testing
> > system (more often than speedups), so it would certainly need some
> > work
> > to be generally applicable. If anyone is interested in working on
> > this
> > with me, please let me know.
> > 
> > Some future work (in no particular order):
> > 
> > 1. Use VTTI instead of (or in addition to) CodeMetrics in order to
> > get
> > a more-accurate estimate of the loop iteration cost.
> > 
> > 2. Use loop dependence analysis to inhibit prefetching of loads
> > we've
> > just recently accessed in previous iterations (and maybe nearby
> > data?)
> > 
> > 3. Additional heuristics to limit prefetch insertion when we have
> > smarter hardware (like on x86) that needs help only with
> > more-difficult
> > cases
> > 
> > 4. node->next prefetching for linked-list iteration
> > 
> > In short, I'm sending this e-mail as (hopefully) a conversation-
> > starter. As is, the pass is quite useful for me, and I'd like to
> > know
> > what kinds of things need to happen to make it useful more
> > generally. I
> > have only a very basic idea of what this means for smarter hardware
> > and
> > ooo cores, so feedback is certainly welcome.
> > 
> > That having been said, I'd like to commit this to trunk (turned off
> > by
> > default). As a side note, gcc has -fprefetch-loop-arrays, and we
> > could
> > similarly add a Clang to enable this pass.
> > 
> > Thanks again,
> > Hal
> > 
> > Some good references are:
> > [1] Memory part 5: What programmers can do.
> > Ulrich Drepper, 2007.
> > http://lwn.net/Articles/255364/
> > 
> > [2] When Prefetching Works, When It Doesn’t, and Why Jaekyu Lee,
> > Hyesoon Kim, and Richard Vuduc, 2012.
> > http://vuduc.org/pubs/lee2012-taco.pdf
> > 
> > P.S. It occurs to me that this probably won't apply against today's
> > trunk because of the header renaming, but I can post a rebased
> > patch
> > soon.
> > --
> > Hal Finkel
> > Postdoctoral Appointee
> > Leadership Computing Facility
> > Argonne National Laboratory
> 
> 

-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory