[PATCH] D55604: [AggressiveInstCombine] convert rotate with guard branch into funnel shift (PR34924)
Nikita Popov via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Wed Dec 12 14:33:16 PST 2018
nikic added inline comments.
================
Comment at: lib/Transforms/AggressiveInstCombine/AggressiveInstCombine.cpp:95
+ return Intrinsic::not_intrinsic;
+ };
+
----------------
spatel wrote:
> fabiang wrote:
> > nikic wrote:
> > > spatel wrote:
> > > > nikic wrote:
> > > > > Rather than matching the rotate pattern here, would it be possible to leave the conversion to fsh to InstCombine (as you say, this is planned anyway), and only do the removal of the zero check here? Or is general canonicalization in InstCombine still blocked on some optimization issues?
> > > > There's no clear limit on the power of aggressive-instcombine, but I'm assuming that we should follow the same rules as regular instcombine. Ie, you should be able to transfer anything in here to instcombine if you don't care about compile-time.
> > > >
> > > > So we defer CFG changes to other passes (simplifycfg in particular) and just focus on logical transforms. This one blurs the lines of course, but I don't think we should be trying to alter the branch rather than the phi here.
> > > >
> > > > The only remaining optimization issue that I'm aware of for funnel shift canonicalization is addressed by D55563.
> > > I'm not suggesting to alter the branches here, but rather to split this transform into a two step process. First InstCombine converts the `(x << y) | (x >> (bw - y))` into `fshl(x, x, y)` (which is legal independently of the null check for rotates, but not general funnel shifts) and then here we only handle the `y == 0 ? x : fshl(x, x, y)` pattern, rather than the shift-based variant.
> > >
> > > However, after reading https://bugs.llvm.org/show_bug.cgi?id=34924, maybe this would not even be necessary? In one of the comments at the end you mention that if the rotate is represented as fsh, then simplifycfg already converts this pattern into a select. Together with the fsh+select simplification you already added in instsimplify, wouldn't that mean that a combination of instcombine+simplifycfg+instsimplify will already take care of the whole pattern (if we canonicalize the shl/ashr/or sequence to fsh in instcombine)?
> > >
> > > So the process would go like this:
> > >
> > > ```
> > > define i32 @rotl(i32 %a, i32 %b) {
> > > entry:
> > > %cmp = icmp eq i32 %b, 0
> > > br i1 %cmp, label %end, label %rotbb
> > >
> > > rotbb:
> > > %sub = sub i32 32, %b
> > > %shr = lshr i32 %a, %sub
> > > %shl = shl i32 %a, %b
> > > %or = or i32 %shr, %shl
> > > br label %end
> > >
> > > end:
> > > %cond = phi i32 [ %or, %rotbb ], [ %a, %entry ]
> > > ret i32 %cond
> > > }
> > > // -instcombine
> > > define i32 @rotl(i32 %a, i32 %b) {
> > > entry:
> > > %cmp = icmp eq i32 %b, 0
> > > br i1 %cmp, label %end, label %rotbb
> > >
> > > rotbb:
> > > %or = call i32 @llvm.fshl(i32 %a, i32 %a, i32 %b)
> > > br label %end
> > >
> > > end:
> > > %cond = phi i32 [ %or, %rotbb ], [ %a, %entry ]
> > > ret i32 %cond
> > > }
> > > // -simplifycfg
> > > define i32 @rotl(i32 %a, i32 %b) {
> > > entry:
> > > %cmp = icmp eq i32 %b, 0
> > > %or = call i32 @llvm.fshl(i32 %a, i32 %a, i32 %b)
> > > %cond = select i1 %cmp, i32 %or, %a
> > > ret i32 %cond
> > > }
> > > // -instsimplify
> > > define i32 @rotl(i32 %a, i32 %b) {
> > > entry:
> > > %or = call i32 @llvm.fshl(i32 %a, i32 %a, i32 %b)
> > > ret i32 %or
> > > }
> > > ```
> > I agree that it feels a bit redundant to have the pattern replicated twice, once for branches here and once for selects in instsimplify (in what was D54552).
> >
> > If the contents of rotbb get reliably converted into funnel shifts by instcombine, and simplifycfg turns it into a select, the existing code seems sufficient.
> >
> > A counterargument may be that the more passes need to collaborate to handle this pattern as intended, the bigger the chances are of something going wrong in the middle of the process.
> Ok - this wasn't clear in the bug comments, so let me try to explain it better here. We can't turn this alone:
> %sub = sub i32 32, %b
> %shr = lshr i32 %a, %sub
> %shl = shl i32 %a, %b
> %or = or i32 %shr, %shl
>
> ...into fshl/fshr. The reason is that -- for a target that does not have a rotate instruction -- this will get expanded into a sequence with more ops and almost certainly worse perf. And there's no way for the backend to reverse that, so that's forbidden AFAIK. If the code already has some attempt to guard against UB (by masking/selecting/branching), then we can turn it into funnel shift because we can (almost) guarantee that our worst-case expansion is still equal or better than the original code.
Okay, that makes sense. For targets with rotate it's a trivial win, but for those without we can only perform the transform here because we're trading off a branch for two extra ands.
In that case, I'm wondering if this transform should be restricted to power-of-two bitwidths. Otherwise this might end up introducing urems rather than simple masking (and looking at the current DAG code, we'd actually even add a select on zero, even though that doesn't seem necessary for rotates). Not that it really matters in the end, as I'm having a hard time imagining how one would end up with a non-power-of-two rotate.
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D55604/new/
https://reviews.llvm.org/D55604
More information about the llvm-commits
mailing list