[llvm-commits] [PATCH/RFC] Loop Data Software Prefetching
hfinkel at anl.gov
Wed Jan 2 15:16:51 PST 2013
I've started working on a transformation pass to add explicit prefetch instructions to prefetch loop data into the L1 cache. The usefulness of this kind of transformation is obviously target dependent (and, moreover, often dependent on how the hardware prefetcher(s) operate). Nevertheless, there are some motivational factors common to many platforms:
1. On many platforms, to prevent L1 cache pollution, the hardware prefetcher does not prefetch into the L1 cache (or, if it does, it is not aggressive enough to achieve maximal performance). This is true on some of my PowerPC hardware, and also true on some modern x86 hardware (such as Nehalem and Barcelona ). The hardware prefetchers will prefetch only into L2 (or L3, etc.) but not into L1, and so explicit software prefetching is the only way to pre-load data into the L1 cache. While it is often true that out-of-order cores can hide the latency of L1 misses, explicit prefetching can still sometimes help.
2. Software prefetching is useful for short streams (most prefetchers require at least 2 misses to establish a prefetching stream).
3. Software prefetching is useful for irregular (but predicable) data access patterns.
4. Software prefetching is useful when prefetching all of the necessary data for a loop would exceed the number of streams that the hardware can handle. The number of streams is often fairly limited (~8-32), and the streams are often shared between cores (and hardware threads) for upper-level caches. In cases where a large number of streams would be needed, software prefetching may be the only way to prefetch data for the loop.
5. Hardware prefetching often cannot cross page boundaries , and so software prefetching is necessary to prevent misses on page boundaries (and pages can be ~4 kB on many systems).
The initial version of this pass is fairly simple. It uses CodeMetrics to estimate the number of cycles necessary to execute the loop body, and divides that by a heuristic prefetch latency to calculate for how many loop iterations ahead to prefetch data. It then inserts prefetch instructions after every load (but not for loads within one cache line size of some already-prefetched load to avoid double-prefetching cache lines). This is fairly effective on my PowerPC hardware, and (somewhat to my surprise), is sometimes beneficial on my x86 test machine. To be clear, using this pass often produces slowdowns on my Xeon testing system (more often than speedups), so it would certainly need some work to be generally applicable. If anyone is interested in working on this with me, please let me know.
Some future work (in no particular order):
1. Use VTTI instead of (or in addition to) CodeMetrics in order to get a more-accurate estimate of the loop iteration cost.
2. Use loop dependence analysis to inhibit prefetching of loads we've just recently accessed in previous iterations (and maybe nearby data?)
3. Additional heuristics to limit prefetch insertion when we have smarter hardware (like on x86) that needs help only with more-difficult cases
4. node->next prefetching for linked-list iteration
In short, I'm sending this e-mail as (hopefully) a conversation-starter. As is, the pass is quite useful for me, and I'd like to know what kinds of things need to happen to make it useful more generally. I have only a very basic idea of what this means for smarter hardware and ooo cores, so feedback is certainly welcome.
That having been said, I'd like to commit this to trunk (turned off by default). As a side note, gcc has -fprefetch-loop-arrays, and we could similarly add a Clang to enable this pass.
Some good references are:
 Memory part 5: What programmers can do.
Ulrich Drepper, 2007.
 When Prefetching Works, When It Doesn’t, and Why
Jaekyu Lee, Hyesoon Kim, and Richard Vuduc, 2012.
P.S. It occurs to me that this probably won't apply against today's trunk because of the header renaming, but I can post a rebased patch soon.
Leadership Computing Facility
Argonne National Laboratory
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 16972 bytes
Desc: not available
More information about the llvm-commits