[LLVMdev] Adjusting Load Latencies

Fri Mar 2 14:59:13 PST 2012

On Fri, 02 Mar 2012 13:49:48 -0800
Andrew Trick <atrick at apple.com> wrote:

> On Mar 2, 2012, at 9:01 AM, Hal Finkel <hfinkel at anl.gov> wrote:
> 
> > Hello,
> > 
> > I am interested in writing an analysis pass that looks at the stride
> > used for loads in a loop and passes that information down so that it
> > can be used by the instruction scheduler. The reason is that if the
> > load stride is greater than the cache line size, then I would expect
> > the load to always miss the cache, and, as a result, the scheduler
> > should use a much larger effective latency when scheduling the load
> > and its dependencies. Cache-miss metadata might also be a good
> > supplemental option. I can add methods to TLI that can convert the
> > access stride information into effective latency information, but
> > what is the best way to annotate the loads so that the information
> > will be available to the SDNodes?
> > 
> > Has anyone tried something like this before?
> > 
> > A related issue is automatically adding prefetching to loops. The
> > trick here is to accurately estimate the number of cycles the loop
> > body will take the execute (so that you prefetch the correct amount
> > ahead). This information is not really available until instruction
> > scheduling, and so prefetch adding cannot really complete until just
> > before MC generation (the prefetch instructions can be scheduled,
> > but their constant offset needs to be held free for a while). In
> > addition, estimating the number of cycles also requires relatively
> > accurate load/store latiencies, and this, in turn, requires
> > cache-miss latencies to be accounted for (which must then account
> > for the prefetches).
> > 
> > If anyone has thoughts on these ideas, I would like to hear them.
> 
> 

Andy,

Thank you for writing such a detailed response.

> If you annotate loads with their expected latency, the upcoming
> MachineScheduler will be able to use the information. In the short
> term (next couple months), you're free to hack the SDScheduler as
> well.

Alright, sounds good. If I add metadata to the load, can I get to it
thought the Value * in the associated MachineMemOperand object? 

> 
> Although the scheduler can use the information, I don't think it can
> do much good with it scheduling for mainstream targets. It would be
> more interesting scheduling for an in-order machine without a
> hardware prefetch unit.
> 
> An acyclic instruction scheduler can schedule for L1 and L2 latency
> at most. But the out-of-order engine should be able to compensate for
> these latencies.
> 
> L2 misses within a high trip count loop will benefit greatly from
> stride prefetching. But regular strides should already be handled in
> hardware.

I agree. For the machine with which I'm working, the hardware prefetch
unit only works if you access N consecutive cache lines. Any pattern
that does not do that will need explicit prefetch instructions. Also, I
need to be careful not to prefetch too much because the request buffer
is fairly small (it handles < 10 outstanding requests).

> 
> So my suggestions are:
> 
> 1) If you have an in-order machine, your workload actually fits in
> L2, and you care deeply about every stall cycle, it may be useful for
> the scheduler distinguish between expected L1 vs L2 latency. Try to
> issue multiple L1 misses in parallel or cover their latency. You can
> consider offsets from aligned objects in addition to induction
> variable strides.

Issuing multiple L1 misses in parallel is exactly what I would like to
do. Offsets from aligned objects is also a good idea.

> 
> 2) If you have a machine without hardware prefetching, you really
> need to insert prefetches. This is a much bigger bang for the buck
> than scheduling for L1/L2 latency. To cover the latency of an L2
> miss, which is what really matters, you need to prefetch many
> iterations ahead. Rather that trying to predict the number of cycles
> each iteration takes, you're better off prefetching as many
> iterations ahead as possible up to the hardware's limit on
> outstanding loads. If the loop has a constant trip count, you can
> probably do something clever. Otherwise I think branch profiling
> could help by telling you which loops have a very high trip count. I
> think prefetch insertion is more closely tied to loop unrolling than
> instruction scheduling.

I'm afraid that "as many iterations ahead as possible" may turn out to
be too few if I start at the very next iteration because the request
buffer is small. Nevertheless, it is certainly worth a try.

Thanks again,
Hal

> 
> -Andy

-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory