[llvm-commits] [PATCH/RFC] Loop Data Software Prefetching

Thu Jan 3 09:39:20 PST 2013

Hal,

  Thank you for raising this topic. I would love to see this implemented and extended. For instance - there is only one step from prefetching to cache line invalidation (once you know that a loop writes enough data). I would also like to see how it can be integrated with SW pipelining on targets with static scheduling for improved determinism.

  In short, I am +1 on "I'd like to commit this to trunk (turned off by default)."

Sergei

---
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation

> -----Original Message-----
> From: llvm-commits-bounces at cs.uiuc.edu [mailto:llvm-commits-
> bounces at cs.uiuc.edu] On Behalf Of Hal Finkel
> Sent: Wednesday, January 02, 2013 5:17 PM
> To: l >> Commit Messages and Patches for LLVM
> Subject: [llvm-commits] [PATCH/RFC] Loop Data Software Prefetching
> 
> Hi again,
> 
> I've started working on a transformation pass to add explicit prefetch
> instructions to prefetch loop data into the L1 cache. The usefulness of
> this kind of transformation is obviously target dependent (and,
> moreover, often dependent on how the hardware prefetcher(s) operate).
> Nevertheless, there are some motivational factors common to many
> platforms:
> 
> 1. On many platforms, to prevent L1 cache pollution, the hardware
> prefetcher does not prefetch into the L1 cache (or, if it does, it is
> not aggressive enough to achieve maximal performance). This is true on
> some of my PowerPC hardware, and also true on some modern x86 hardware
> (such as Nehalem and Barcelona [2]). The hardware prefetchers will
> prefetch only into L2 (or L3, etc.) but not into L1, and so explicit
> software prefetching is the only way to pre-load data into the L1
> cache. While it is often true that out-of-order cores can hide the
> latency of L1 misses, explicit prefetching can still sometimes help.
> 
> 2. Software prefetching is useful for short streams (most prefetchers
> require at least 2 misses to establish a prefetching stream).
> 
> 3. Software prefetching is useful for irregular (but predicable) data
> access patterns.
> 
> 4. Software prefetching is useful when prefetching all of the necessary
> data for a loop would exceed the number of streams that the hardware
> can handle. The number of streams is often fairly limited (~8-32), and
> the streams are often shared between cores (and hardware threads) for
> upper-level caches. In cases where a large number of streams would be
> needed, software prefetching may be the only way to prefetch data for
> the loop.
> 
> 5. Hardware prefetching often cannot cross page boundaries [1], and so
> software prefetching is necessary to prevent misses on page boundaries
> (and pages can be ~4 kB on many systems).
> 
> The initial version of this pass is fairly simple. It uses CodeMetrics
> to estimate the number of cycles necessary to execute the loop body,
> and divides that by a heuristic prefetch latency to calculate for how
> many loop iterations ahead to prefetch data. It then inserts prefetch
> instructions after every load (but not for loads within one cache line
> size of some already-prefetched load to avoid double-prefetching cache
> lines). This is fairly effective on my PowerPC hardware, and (somewhat
> to my surprise), is sometimes beneficial on my x86 test machine. To be
> clear, using this pass often produces slowdowns on my Xeon testing
> system (more often than speedups), so it would certainly need some work
> to be generally applicable. If anyone is interested in working on this
> with me, please let me know.
> 
> Some future work (in no particular order):
> 
> 1. Use VTTI instead of (or in addition to) CodeMetrics in order to get
> a more-accurate estimate of the loop iteration cost.
> 
> 2. Use loop dependence analysis to inhibit prefetching of loads we've
> just recently accessed in previous iterations (and maybe nearby data?)
> 
> 3. Additional heuristics to limit prefetch insertion when we have
> smarter hardware (like on x86) that needs help only with more-difficult
> cases
> 
> 4. node->next prefetching for linked-list iteration
> 
> In short, I'm sending this e-mail as (hopefully) a conversation-
> starter. As is, the pass is quite useful for me, and I'd like to know
> what kinds of things need to happen to make it useful more generally. I
> have only a very basic idea of what this means for smarter hardware and
> ooo cores, so feedback is certainly welcome.
> 
> That having been said, I'd like to commit this to trunk (turned off by
> default). As a side note, gcc has -fprefetch-loop-arrays, and we could
> similarly add a Clang to enable this pass.
> 
> Thanks again,
> Hal
> 
> Some good references are:
> [1] Memory part 5: What programmers can do.
> Ulrich Drepper, 2007.
> http://lwn.net/Articles/255364/
> 
> [2] When Prefetching Works, When It Doesn’t, and Why Jaekyu Lee,
> Hyesoon Kim, and Richard Vuduc, 2012.
> http://vuduc.org/pubs/lee2012-taco.pdf
> 
> P.S. It occurs to me that this probably won't apply against today's
> trunk because of the header renaming, but I can post a rebased patch
> soon.
> --
> Hal Finkel
> Postdoctoral Appointee
> Leadership Computing Facility
> Argonne National Laboratory