[LLVMdev] Adjusting Load Latencies

Fri Mar 2 13:49:48 PST 2012

On Mar 2, 2012, at 9:01 AM, Hal Finkel <hfinkel at anl.gov> wrote:

> Hello,
> 
> I am interested in writing an analysis pass that looks at the stride
> used for loads in a loop and passes that information down so that it
> can be used by the instruction scheduler. The reason is that if the
> load stride is greater than the cache line size, then I would expect
> the load to always miss the cache, and, as a result, the scheduler
> should use a much larger effective latency when scheduling the load and
> its dependencies. Cache-miss metadata might also be a good supplemental
> option. I can add methods to TLI that can convert the access stride
> information into effective latency information, but what is the best
> way to annotate the loads so that the information will be available to
> the SDNodes?
> 
> Has anyone tried something like this before?
> 
> A related issue is automatically adding prefetching to loops. The
> trick here is to accurately estimate the number of cycles the loop
> body will take the execute (so that you prefetch the correct amount
> ahead). This information is not really available until instruction
> scheduling, and so prefetch adding cannot really complete until just
> before MC generation (the prefetch instructions can be scheduled, but
> their constant offset needs to be held free for a while). In addition,
> estimating the number of cycles also requires relatively accurate
> load/store latiencies, and this, in turn, requires cache-miss latencies
> to be accounted for (which must then account for the prefetches).
> 
> If anyone has thoughts on these ideas, I would like to hear them.

If you annotate loads with their expected latency, the upcoming MachineScheduler will be able to use the information. In the short term (next couple months), you're free to hack the SDScheduler as well.

Although the scheduler can use the information, I don't think it can do much good with it scheduling for mainstream targets. It would be more interesting scheduling for an in-order machine without a hardware prefetch unit.

An acyclic instruction scheduler can schedule for L1 and L2 latency at most. But the out-of-order engine should be able to compensate for these latencies.

L2 misses within a high trip count loop will benefit greatly from stride prefetching. But regular strides should already be handled in hardware.

So my suggestions are:

1) If you have an in-order machine, your workload actually fits in L2, and you care deeply about every stall cycle, it may be useful for the scheduler distinguish between expected L1 vs L2 latency. Try to issue multiple L1 misses in parallel or cover their latency. You can consider offsets from aligned objects in addition to induction variable strides. 

2) If you have a machine without hardware prefetching, you really need to insert prefetches. This is a much bigger bang for the buck than scheduling for L1/L2 latency. To cover the latency of an L2 miss, which is what really matters, you need to prefetch many iterations ahead. Rather that trying to predict the number of cycles each iteration takes, you're better off prefetching as many iterations ahead as possible up to the hardware's limit on outstanding loads. If the loop has a constant trip count, you can probably do something clever. Otherwise I think branch profiling could help by telling you which loops have a very high trip count. I think prefetch insertion is more closely tied to loop unrolling than instruction scheduling.

-Andy