[PATCH] D25987: [X86] New pattern to generate PSUBUS from SELECT

Fri Oct 27 21:56:22 PDT 2017

yulia_koval added a comment.

In https://reviews.llvm.org/D25987#743482, @spatel wrote:

> In https://reviews.llvm.org/D25987#743378, @yulia_koval wrote:
>
> > Well, first 2 tests don't work with my patch because PSUBUS is a vector instruction. Test15 is working perfectly. Test15_max doesn't work, because I didn't try to recognize this pattern(the one you showed in a link), after optimization, because currently(for at least a year) llvm generates the code like in my tests, for example for psubus.c:
>
>
> It is irrelevant how long the current IR has been produced. It is also irrelevant what form the current IR takes until we agree that it is the canonical form. Let me try to explain.
>
> > void foo(unsigned short *p, int max, int n)
> >  {
> >  int i;
> >  unsigned m;
> >  for (i = 0; i < n; i++) {
> > 
> >   m = *--p;
> >   *p = (unsigned short)(m >= max ? m-max : 0);
> > 
> > }
> >  }
> > 
> > IR will be:
> > 
> >   %10 = zext <8 x i16> %reverse to <8 x i32>
> >   %11 = icmp ugt <8 x i32> %broadcast.splat16, %10
> >   %12 = sub <8 x i32> %10, %broadcast.splat16
> >   %13 = trunc <8 x i32> %12 to <8 x i16>
> >   %14 = select <8 x i1> %11, <8 x i16> zeroinitializer, <8 x i16> %13
>
> This looks good. I'm glad we can vectorize this. :)
>
> But do you agree that it is reasonable to write this function in C as:
>
>   void goo(unsigned short *p, int max, int n) {
>     int i;
>     unsigned m;
>     for (i = 0; i < n; i++) {
>       m = *--p;
>       unsigned umax = m > max ? m : max;
>       *p = (unsigned short)(umax - max);
>     }
>   }
>
>
> If yes, then I think you would also agree that LLVM should optimize that code in exactly the same way (to eventually produce a psubus when compiled for x86 -msse4.1). Now, if you agree with that, then the question is where should we recognize that these 2 functions are logically equivalent? I'm saying that it should occur in IR because that's the main job of IR: to translate source code into a canonical form that the backend can then optimize to the best machine-code for a given target.
>
> > I see, that optimization you provided in http://rise4fun.com/Alive/5e4 is valid, but it extends from i16 to i32 which may trigger some regressions, because it switches to wider operations. Maybe something like this will appear in the future, but currently the pattern above is generated. Also, whether the pattern generated from this particular C code will be changed or not, this patch will still be useful for this IR pattern, if it appears. I can add recognition of the test15_max pattern as well, should I do it?
>
> No. Until we decide what the canonical IR for this pattern looks like, you may be adding a useless transform (dead code) to the backend. "The future" where we no longer produce the IR pattern that this patch is looking for could be today, so let's not prematurely add a backend optimization until we know that pattern will actually exist. You're suggesting that we match multiple patterns in the backend to produce psubus. I'm suggesting that if we can canonicalize this pattern in IR, then we will more efficiently optimize it in the backend regardless of how it was written in source. Does this make sense?
>
> > Also if some optimization replace select(sub) with sub(select), won't it ruin any other psubus pattern as well?
>
> It is certainly possible. We should investigate the current patterns that match to psubus. As shown in your examples in this patch and my follow-up test15_max, the matching of psubus is not complete. Will we break the existing matching by canonicalizing the IR? If yes, we should add those transforms first to avoid regressions.
>
> So the steps to fix this properly:
>
> 1. Decide what is the canonical IR.
> 2. Add codegen tests for that IR (this should likely include more than x86).
> 3. Optimize those patterns in the backend (Do other targets have instructions equivalent to psubus? If yes, a generic DAGCombiner transform should be added.)
> 4. Add the canonicalization to IR.
>
>   This should ensure that no matter how the source is written, we'll get optimal codegen for all targets, and we'll do that efficiently.

In https://reviews.llvm.org/D25987#743482, @spatel wrote:

> In https://reviews.llvm.org/D25987#743378, @yulia_koval wrote:
>
> > Well, first 2 tests don't work with my patch because PSUBUS is a vector instruction. Test15 is working perfectly. Test15_max doesn't work, because I didn't try to recognize this pattern(the one you showed in a link), after optimization, because currently(for at least a year) llvm generates the code like in my tests, for example for psubus.c:
>
>
> It is irrelevant how long the current IR has been produced. It is also irrelevant what form the current IR takes until we agree that it is the canonical form. Let me try to explain.
>
> > void foo(unsigned short *p, int max, int n)
> >  {
> >  int i;
> >  unsigned m;
> >  for (i = 0; i < n; i++) {
> > 
> >   m = *--p;
> >   *p = (unsigned short)(m >= max ? m-max : 0);
> > 
> > }
> >  }
> > 
> > IR will be:
> > 
> >   %10 = zext <8 x i16> %reverse to <8 x i32>
> >   %11 = icmp ugt <8 x i32> %broadcast.splat16, %10
> >   %12 = sub <8 x i32> %10, %broadcast.splat16
> >   %13 = trunc <8 x i32> %12 to <8 x i16>
> >   %14 = select <8 x i1> %11, <8 x i16> zeroinitializer, <8 x i16> %13
>
> This looks good. I'm glad we can vectorize this. :)
>
> But do you agree that it is reasonable to write this function in C as:
>
>   void goo(unsigned short *p, int max, int n) {
>     int i;
>     unsigned m;
>     for (i = 0; i < n; i++) {
>       m = *--p;
>       unsigned umax = m > max ? m : max;
>       *p = (unsigned short)(umax - max);
>     }
>   }
>
>
> If yes, then I think you would also agree that LLVM should optimize that code in exactly the same way (to eventually produce a psubus when compiled for x86 -msse4.1). Now, if you agree with that, then the question is where should we recognize that these 2 functions are logically equivalent? I'm saying that it should occur in IR because that's the main job of IR: to translate source code into a canonical form that the backend can then optimize to the best machine-code for a given target.
>
> > I see, that optimization you provided in http://rise4fun.com/Alive/5e4 is valid, but it extends from i16 to i32 which may trigger some regressions, because it switches to wider operations. Maybe something like this will appear in the future, but currently the pattern above is generated. Also, whether the pattern generated from this particular C code will be changed or not, this patch will still be useful for this IR pattern, if it appears. I can add recognition of the test15_max pattern as well, should I do it?
>
> No. Until we decide what the canonical IR for this pattern looks like, you may be adding a useless transform (dead code) to the backend. "The future" where we no longer produce the IR pattern that this patch is looking for could be today, so let's not prematurely add a backend optimization until we know that pattern will actually exist. You're suggesting that we match multiple patterns in the backend to produce psubus. I'm suggesting that if we can canonicalize this pattern in IR, then we will more efficiently optimize it in the backend regardless of how it was written in source. Does this make sense?
>
> > Also if some optimization replace select(sub) with sub(select), won't it ruin any other psubus pattern as well?
>
> It is certainly possible. We should investigate the current patterns that match to psubus. As shown in your examples in this patch and my follow-up test15_max, the matching of psubus is not complete. Will we break the existing matching by canonicalizing the IR? If yes, we should add those transforms first to avoid regressions.
>
> So the steps to fix this properly:
>
> 1. Decide what is the canonical IR.
> 2. Add codegen tests for that IR (this should likely include more than x86).
> 3. Optimize those patterns in the backend (Do other targets have instructions equivalent to psubus? If yes, a generic DAGCombiner transform should be added.)
> 4. Add the canonicalization to IR.
>
>   This should ensure that no matter how the source is written, we'll get optimal codegen for all targets, and we'll do that efficiently.

In https://reviews.llvm.org/D25987#908551, @RKSimon wrote:

> @yulia_koval What is the state of this after https://reviews.llvm.org/D37510 and https://reviews.llvm.org/D37534?

The one additional patch is required to apply the changes - cannonicalization part. I will submit it soon.

https://reviews.llvm.org/D25987