[RFC] Heuristic for complete loop unrolling
Michael Zolotukhin
mzolotukhin at apple.com
Mon Jan 26 11:50:23 PST 2015
> On Jan 25, 2015, at 6:06 AM, Hal Finkel <hfinkel at anl.gov> wrote:
>
> ----- Original Message -----
>> From: "Michael Zolotukhin" <mzolotukhin at apple.com <mailto:mzolotukhin at apple.com>>
>> To: "Hal Finkel" <hfinkel at anl.gov <mailto:hfinkel at anl.gov>>
>> Cc: "Arnold Schwaighofer" <aschwaighofer at apple.com <mailto:aschwaighofer at apple.com>>, "Commit Messages and Patches for LLVM"
>> <llvm-commits at cs.uiuc.edu <mailto:llvm-commits at cs.uiuc.edu>>
>> Sent: Sunday, January 25, 2015 1:20:18 AM
>> Subject: Re: [RFC] Heuristic for complete loop unrolling
>>
>>
>> On Jan 24, 2015, at 12:38 PM, Hal Finkel < hfinkel at anl.gov > wrote:
>>
>> ----- Original Message -----
>>
>>
>> From: "Michael Zolotukhin" < mzolotukhin at apple.com >
>> To: "Hal Finkel" < hfinkel at anl.gov >
>> Cc: "Arnold Schwaighofer" < aschwaighofer at apple.com >, "Commit
>> Messages and Patches for LLVM"
>> < llvm-commits at cs.uiuc.edu >
>> Sent: Saturday, January 24, 2015 2:26:03 PM
>> Subject: Re: [RFC] Heuristic for complete loop unrolling
>>
>>
>> Hi Hal,
>>
>>
>> Thanks for the review! Please see my comments inline.
>>
>>
>>
>>
>> On Jan 24, 2015, at 6:44 AM, Hal Finkel < hfinkel at anl.gov > wrote:
>>
>> [moving patch review to llvm-commits]
>>
>> +static bool CanEliminateLoadFrom(Value *V) {
>>
>> + if (Constant *C = dyn_cast<Constant>(V)) {
>>
>> + if (GlobalVariable *GV = dyn_cast<GlobalVariable>(C))
>>
>> + if (GV->isConstant() && GV->hasDefinitiveInitializer())
>>
>>
>> Why are you casting to a Constant, and then to a GlobalVariable, and
>> then checking GV->isConstant()? There seems to be unnecessary
>> redundancy here ;)
>> Indeed:)
>>
>>
>>
>>
>> + return GV->getInitializer();
>>
>> + }
>>
>> + return false;
>>
>> +}
>>
>> +static unsigned ApproximateNumberOfEliminatedInstruction(const Loop
>> *L,
>>
>> + ScalarEvolution &SE) {
>>
>> This function seems mis-named. For one thing, it is not counting the
>> number of instructions potentially eliminated, but the number of
>> loads. Eliminating the loads might have lead to further constant
>> propagation, and really calculating the number of eliminated
>> instructions would require estimating that effect, right?
>> That’s right. But I’d rather change the function name, than add such
>> calculations, since it’ll look very-very narrow targeted, and I
>> worry that we might start to lose some cases as well. How about
>> ‘NumberOfConstantFoldedLoads’?
>>
>>
>> Sounds good to me.
>>
>>
>>
>>
>>
>>
>>
>> if (TripCount && Count == TripCount) {
>>
>> - if (Threshold != NoThreshold && UnrolledSize > Threshold) {
>>
>> + if (Threshold != NoThreshold && UnrolledSize > Threshold + 20 *
>> ElimInsns) {
>>
>>
>> 20, huh? Is this a heuristic for constant propagation? It feels like
>> we should be able to do better than this.
>> Yep, that’s a parameter of the heuristic. Counting each ‘constant'
>> load as 1 is too conservative and doesn’t give much here,
>>
>> Understood.
>>
>>
>>
>> but since
>> we don’t actually count number of eliminated instructions we need
>> some estimate for it. This estimate is really rough, since in some
>> cases we can eliminate the entire loop body, while in the others we
>> can’t eliminate anything.
>>
>> I think that we might as well estimate it. Once we know the loads
>> that unrolling will constant fold, put them into a SmallPtrSet S.
>> Then, use a worklist-based iteration of the uses of instructions in
>> S. For each use that has operands that are all either constants, or
>> in S, queue them and keep going. Make sure you walk breadth first
>> (not depth first) so that you capture things will multiple feeder
>> loads properly. As you do this, count the number of instructions
>> that will be constant folded (keeping a Visited set in the usual way
>> so you don't over-count). This will be an estimate, but should be a
>> very accurate one, and can be done without any heuristic parameters
>> (and you still only visit a linear number of instructions, so should
>> not be too expensive).
>> The biggest gain comes not from expressions buf[0]*buf[3], which can
>> be constant folded when buf[0] and buf[3] are substituted with
>> constants, but from x*buf[0], when buf[0] turns out to be 0, or 1.
>> I.e. we don’t constant-fold it, but we simplify the expression based
>> on the particular value. I.e. the original convolution test is in
>> some sense equivalent to the following code:
>> const int a[] = [0, 1, 5, 1, 0];
>> int *b;
>> for(i = 0; i < 100; i ++) {
>> for(j = 0; j < 5; j++)
>> b[i] += b[i+j]*a[j];
>> }
>>
>>
>> Complete unrolling will give us:
>>
>> for(i = 0; i < 100; i ++) {
>> b[i] += b[i]*a[0];
>>
>> b[i] += b[i+1]*a[1];
>>
>> b[i] += b[i+2]*a[2];
>>
>> b[i] += b[i+3]*a[3];
>>
>> b[i] += b[i+4]*a[4];
>> }
>>
>>
>> After const-prop we’ll get:
>>
>> for(i = 0; i < 100; i ++) {
>> b[i] += b[i]*0;
>>
>> b[i] += b[i+1]*1;
>>
>> b[i] += b[i+2]*5;
>>
>> b[i] += b[i+3]*1;
>>
>> b[i] += b[i+4]*0;
>> }
>> And after simplification:
>>
>> for(i = 0; i < 100; i ++) {
>> b[i] += b[i+1];
>>
>>
>> b[i] += b[i+2]*5;
>>
>> b[i] += b[i+3];
>>
>> }
>>
>>
>> As you can see, we actually folded nothing here, but rather we
>> simplified more than a half of instructions. But it’s hard to
>> predict, which optimizations will be enabled by exact values
>> replaced loads
>
> Agreed.
>
> - i.e. of course we can check it, but it would be too
>> optimization-specific in my opinion.
>
> I understand your point, but I think using one magic number is swinging the pendulum too far in the other direction.
>
>> Thus, I decided to use some
>> number (20 in the current patch) which represents some average
>> profitability of replacing a load with constant. I think I should
>> get rid of this magic number though, and replace it with some
>> target-specific parameter (that’ll help to address Owen’s
>> suggestions).
>
> Using a target-specific cost (or costs) is a good idea, but having target-specific magic numbers is going to be a mess. Different loops will unroll, or not, for fairly arbitrary reasons, on different targets. This is a heuristic, that I understand, and you won't be able to exactly predict all possible later simplifications enabled by the unrolling. However, the fact that you currently need a large per-instruction boost factor, like 20, I think, means that the model is too coarse.
>
> Maybe it would not be unreasonable to start from the other side: If we first located loads that could be constant folded, and determined the values those loads would actually take, and then simplified the loop instructions based on those values, we'd get a pretty good estimate. This is essentially what the code in lib/Analysis/IPA/InlineCost.cpp does when computing the inlining cost savings, and I think we could use the same technique here. Admittedly, the inline cost analysis is relatively expensive, but I'd think we could impose a reasonable size cutoff to limit the overall expense for the case of full unrolling -- for this, maybe the boost factor of 20 is appropriate -- and would also address Owen's point, as far as I can tell, because like the inline cost analysis, we can use TTI to compute the target-specific cost of GEPs, etc. Ideally, we could refactor the ICA a little bit, and re-use the code in there directly for this purpose. This way we can limit the number of places in which we compute similar kinds of heuristic simplification costs.
Hi Hal,
Thanks for the comments. I think that’s a good idea, and I’ll try that out. If that doesn’t lead to over-complicated implementation, I’d like it much better than having a magic number. I’ll prepare a new patch soon.
Thanks,
Michael
>
> -Hal
>
>>
>>
>> Thanks,
>> Michael
>>
>>
>>
>>
>>
>> -Hal
>>
>>
>>
>>
>>
>> I’d really appreciate a feedback on how to model this in a better
>> way.
>>
>>
>> Thanks,
>> Michael
>>
>>
>>
>>
>> Thanks again,
>> Hal
>>
>> ----- Original Message -----
>>
>>
>> From: "Michael Zolotukhin" < mzolotukhin at apple.com >
>> To: "LLVM Developers Mailing List ( llvmdev at cs.uiuc.edu )" <
>> llvmdev at cs.uiuc.edu >
>> Cc: "Hal J. Finkel" < hfinkel at anl.gov >, "Arnold Schwaighofer" <
>> aschwaighofer at apple.com >
>> Sent: Friday, January 23, 2015 2:05:11 PM
>> Subject: [RFC] Heuristic for complete loop unrolling
>>
>>
>>
>> Hi devs,
>>
>> Recently I came across an interesting testcase that LLVM failed to
>> optimize well. The test does some image processing, and as a part of
>> it, it traverses all the pixels and computes some value basing on
>> the adjacent pixels. So, the hot part looks like this:
>>
>> for(y = 0..height) {
>> for (x = 0..width) {
>> val = 0
>> for (j = 0..5) {
>> for (i = 0..5) {
>> val += img[x+i,y+j] * weight[i,j]
>> }
>> }
>> }
>> }
>>
>> And ‘weight' is just a constant matrix with some coefficients.
>>
>> If we unroll the two internal loops (with tripcount 5), then we can
>> replace weight[i,j] with concrete constant values. In this
>> particular case, many of the coefficients are actually 0 or 1, which
>> enables huge code simplifications later on. But currently we unroll
>> only the innermost one, because unrolling both of them will exceed
>> the threshold.
>>
>> When deciding whether to unroll or not, we currently look only at the
>> instruction count of the loop. My proposal is to, on top of that,
>> check if we can enable any later optimizations by unrolling - in
>> this case by replacing a load with a constant. Similar to what we do
>> in inlining heuristics, we can estimate how many instructions would
>> be potentially eliminated after unrolling and adjust our threshold
>> with this value.
>>
>> I can imagine that it might be also useful for computations,
>> involving sparse constant matrixes (like identity matrix).
>>
>> The attached patch implements this feature, and with it we handle the
>> original testcase well.
>>
>>
>>
>>
>>
>> Does it look good? Of course, any ideas, suggestions and other
>> feedback are welcome!
>>
>>
>> Thanks,
>> Michael
>>
>> --
>> Hal Finkel
>> Assistant Computational Scientist
>> Leadership Computing Facility
>> Argonne National Laboratory
>>
>>
>> --
>> Hal Finkel
>> Assistant Computational Scientist
>> Leadership Computing Facility
>> Argonne National Laboratory
>>
>
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20150126/332d6b21/attachment.html>
More information about the llvm-commits
mailing list