[PATCH] D24833: [LoopDataPrefetch/AArch64] Allow selective prefetching of symbolic strided accesses

Wed Sep 28 23:39:54 PDT 2016

bmakam added inline comments.

================
Comment at: lib/Target/AArch64/AArch64Subtarget.cpp:74-75
@@ -73,3 +73,4 @@
     PrefetchDistance = 740;
+    PrefetchDegree = 1;
     MinPrefetchStride = 1024;
     MaxPrefetchIterationsAhead = 11;
----------------
anemet wrote:
> bmakam wrote:
> > anemet wrote:
> > > bmakam wrote:
> > > > anemet wrote:
> > > > > So you are saying on one hand (MinPrefetchStride = 1024) that we shouldn't bother prefetching unless the stride is at least 1K but then you say (PrefetchDegree = 1) that you want to prefetch the very next cache line anytime the stride is not known at compile time.
> > > > > 
> > > > > I feel that there is a contradiction here.  The former suggest that you have a pretty powerful HW prefetcher, the latter that you don't and willing to speculate aggressively to compensate for it.
> > > > > 
> > > > > It seems that something is wrong with the model here.
> > > > Thanks for the review, Adam.
> > > > 
> > > > You are right, If the stride is unknown at compile time we speculate and prefetch the next cache line and if the stride is known we do not need to speculate so we conservatively prefetch for strides > 1K. 
> > > > 
> > > > Sorry, I do not see a contradiction here. Are you suggesting to insert runtime checks to determine that the unknown stride is > 1K and only then prefetch? I just do not see a justification for the additional complexity and probably it might hurt performance due to the runtime checks.
> > > No I am not suggesting run-time checks.
> > > 
> > > Accesses with unknown strides are *only* omitted if MinPrefetchStrides is set (since we can't compare them at compile time).  It seems to me that you don't want to set MinPrefetchStride for your target.  Have you experimented with that?
> > > 
> > > To explain the contradiction a bit more, your model says that you *have* a HW perfetcher that is able to track regular strides < 1024.
> > > 
> > > Your patch contradicts this by saying that even small regular strides are worth prefetching -- the next cache line corresponds to a stride that is way less than 1024.
> > > 
> > > It seems to me that you don't want to set MinPrefetchStride for your target. Have you experimented with that?
> > We do need MinPrefetchStride for Kryo as it seems to benefit spec2006/milc. Spec2006/milc is interesting because it exhibits short stream behavior. These streams are too short (3 x3 matrices using 16 bytes per element, so the total size of matrices is 144 bytes [1]) to train the hardware prefetcher. However, the current model catches these because the stride is > 1K.
> > 
> > > Your patch contradicts this by saying that even small regular strides are worth prefetching -- the next cache line corresponds to a stride that is way less than 1024
> > 
> > The next cache line prefetching need not necessarily correspond to a stride < 1K. Consider for example an interleaved stream
> > a, b, a+k, b+k, a+2k, b+2k.... The inner loop stride is k but a and b are cache line apart. In the first iteration of the outer loop we access a, a+k,a+2k and so on.. and in second iteration of outer loop we access b, b+k, b+2k and so on... With the next line prefetching we prefetch b, b+k, b+2k... whenever we access a, a+k, a+2k... respectively. This is the reason spec2006/mcf improves.
> > 
> > ---
> > [1] When Prefetching Works, When It Doesn’t, and Why. JAEKYU LEE et. al.
> > 
> > 
> > We do need MinPrefetchStride for Kryo as it seems to benefit spec2006/milc. Spec2006/milc is interesting because it exhibits short stream behavior. These streams are too short (3 x3 matrices using 16 bytes per element, so the total size of matrices is 144 bytes [1]) to train the hardware prefetcher. However, the current model catches these because the stride is > 1K.
> 
> The stride is milc is 2048 bytes but you don't need to set MinPrefetchStride to get that.  MinPrefetchStride is the minimum.  Even if you don't set it (i.e. = 0), it will cover the milc case.
> 
> > The next cache line prefetching need not necessarily correspond to a stride < 1K. Consider for example an interleaved stream
> > a, b, a+k, b+k, a+2k, b+2k.... The inner loop stride is k but a and b are cache line apart. In the first iteration of the outer loop we access a, a+k,a+2k and so on.. and in second iteration of outer loop we access b, b+k, b+2k and so on... With the next line prefetching we prefetch b, b+k, b+2k... whenever we access a, a+k, a+2k... respectively. This is the reason spec2006/mcf improves.
> 
> Possibly, but this is not the situation you're analyzing for.  You are really getting this by the chance of how you set up your heuristic.  
> The stride is milc is 2048 bytes but you don't need to set MinPrefetchStride to get that. MinPrefetchStride is the minimum. Even if you don't set it (i.e. = 0), it will cover the milc case.

Yes, but this heuristic only improves spec2006/milc.

> Possibly, but this is not the situation you're analyzing for. You are really getting this by the chance of how you set up your heuristic.

The chance that the data in the next cache line will be soon accessed is high when the access pattern is irregular (unknown compile time stride). This is the situation I am analyzing for and the chance seems pretty high as it improves several benchmarks not just spec2006/mcf.




https://reviews.llvm.org/D24833