[PATCH] D109676: [HardwareLoops] put +1 for loop count before zero extension

Wed Oct 20 04:46:27 PDT 2021

shchenz added inline comments.

================
Comment at: llvm/lib/Analysis/TargetTransformInfo.cpp:188
+  // as not overflow. Hareware count register will handle the count 0 case well.
+  // For example, on PowerPC, if the value in the Count Register is 0 before
+  // being decremented, it is -1 afterward.
----------------
samparker wrote:
> shchenz wrote:
> > samparker wrote:
> > > shchenz wrote:
> > > > samparker wrote:
> > > > > This comment is making me uneasy... I'm don't think target specific stuff should be referenced in this generic layer. But either way, why would a count register ever have a value of -1, especially when we're expending quite some effort to test against zero in this transform?
> > > > PowerPC don't have test and set form. On PowerPC, if the count register is 0 before being decremented, it will be wrapped to -1(all bits are 1) afterward, so we can handle the overflow case when `ExitCount` type is same with `CountType`. The total loop count is u64_MAX + 1
> > > > 
> > > > But on other targets like ARM, the hardware loops pass handles the overflow case for `ExitCount = SE.getAddExpr(ExitCount, SE.getOne(CountType));` when `ExitCount` type is same with `CountType` with test and set form? If this is the case, yes, the comments is not right, we need to update it for different targets.
> > > > 
> > > > But for all targets, making ` SE.getAddExpr(ExitCount, SE.getOne(CountType));` as not overflow when `SE.getTypeSizeInBits(ExitCount->getType()) == CountType->getBitWidth()` should be OK. Right?
> > > Regardless of test.set support, all the intrinsics which set the counter use the trip count. So, if this is zero, as you're suggesting, the loop should never execute and so your counter should never be decremented to -1. What am I missing here? Could you point me at a test for an example?
> > > 
> > > The test.set forms, for ARM, are just so we can try to map on to 'while' loops and we can't presume anything in this method about test.set being successfully generated.
> > > 
> > > > But for all targets, making  SE.getAddExpr(ExitCount, SE.getOne(CountType)); as not overflow when SE.getTypeSizeInBits(ExitCount->getType()) == CountType->getBitWidth() should be OK. Right?
> > > Yes, this sounds fine and I think this would be the only condition needed for the overflow check.
> > Hi, let me describe the issue a little bit. Sorry for the long comment and appreciate your patience. @samparker 
> > 
> > 1. Without any change:
> > ```
> > if (!ExitCount->getType()->isPointerTy() &&
> >     ExitCount->getType() != CountType)
> >   ExitCount = SE.getZeroExtendExpr(ExitCount, CountType);
> > 
> > ExitCount = SE.getAddExpr(ExitCount, SE.getOne(CountType));
> > ```
> > Suppose `CountType` is `i64`
> > 
> > For ExitCount type smaller than `i64`, it will be zero-extended to i64, so there will be no overflow issue;
> > For ExitCount type same with `i64`, it will add one directly without checking overflow. This should be functionality right since no issue was found so far.  Technically speaking, `SE.getAddExpr(ExitCount, SE.getOne(CountType));` could be overflow if `ExitCount` is u64_max before adding one. This case can work right on PowerPC, because when TripCount(after adding one and overflow) is zero, the loop execution count is u64_max + 1 according to PowerPC count register semantic. So I guess other targets like ARM should have the same hardware behavior or there is some other place to handle the overflow zero case?
> > 
> > 2. With the change in D91724:
> > ```
> >     TripCount = SE.getAddExpr(EC, SE.getOne(EC->getType()));
> > 
> >     if (!EC->getType()->isPointerTy() && EC->getType() != CountType)
> >       TripCount = SE.getZeroExtendExpr(TripCount, CountType);
> > ```
> > 
> > This is not right as we found regression pr51714 and the root cause was identified by Philip:
> > For narrower type, like i32, if `ExitCount` is u32_max before adding one, before the change(zero extend first and the add one), the loop count is  u32_max + 1 (represented in i64), but after the change, the loop count is 0(overflow on type i32).
> > patch D91724 is not right.
> > 
> > 3. Now with this patch D109676:
> > 
> > For `ExitCount` type smaller than `i64`, if we can prove that `ExitCount + 1` will not overflow, we can put +1 before zero-extension. Otherwise, we have to first do zero-extension and then do +1;
> > For `ExitCount` type same with `i64`, according to the legacy logic,  we can treat it as not-overflow even though it can be overflow technically. But like the above comments for `Without any change`, all targets can handle the overflow with some mechanism.
> > 
> > For your question:
> > > So, if this is zero, as you're suggesting, the loop should never execute and so your counter should never be decremented to -1. What am I missing here? Could you point me at a test for an example?
> > 
> > I think the TripCount is zero if `ExitCount` is u64_max before adding one, so hardware loop pass will set hardware count register to 0 as the result of overflow on type i64, and on PowerPC this is a valid loop, the loop execution count is u64_max + 1. No test case in hand, I did some experiments on PowerPC, set loop count register to 0 and after the first decrement (`bdnz` instruction), the loop count register is changed to u64_max(0xffffffffffffffff, -1).
> > 
> > >  I think this would be the only condition needed for the overflow check.
> > 
> > Do you mean we don't need the `isLoopEntryGuardedByCond` for the overflow check? I think `isLoopEntryGuardedByCond` is still needed for types smaller than i64, otherwise, all smaller types are treated as `WillNotOverflow` and we will not zero extend the type in `getTripCountFromExitCount`, we may still meet the same issue with what we met in patch `D91724`?
> Sigh, sorry, I forgot (again) that we haven't checked against ExitCount == UMAX . There's the static function 'mustBeFiniteCountedLoop' in PlaceSafePoints, maybe this could be moved somewhere so we could use it? Or has that kind of check already been tried? Sorry for my confusion and continued questioning.
Haha, don't worry about your continued questions. They are making this patch be more reasonable and clear.

Here, we wouldn't get infinite loop error at least on PowerPC target. When the initial loop count register is 0, the loop execution count is u64_max + 1. 

A typical pattern for a hardware loop on PowerPC is like this:
```
mtctr $gpr  ; move the loop count from $gpr to loop count regsiter CTR
loop_start:  ; start of loop header 
...   ; loop body
...
bdnz loop_start  ; loop latch
```
The semantic of the PowerPC `bdnz` instructions is "Decrement CTR and branch if it is still nonzero". So if the initial value is 0, after the first decrement, the loop count register is changed to `0xffffffffffffffff`. and then after decrement 0xffffffffffffffff times, it will be changed to zero and then the loop is exiting.

Do we meet infinite loop issue on ARM? I think here we only care about overflow.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D109676/new/

https://reviews.llvm.org/D109676