[PATCH] Generation of PSAD in LoopVectorizer

Sun Apr 5 22:16:05 PDT 2015

Hello James,

Thank you for your reply. My comments are below.

- Vijender

-----Original Message-----
From: James Molloy [mailto:james.molloy at arm.com] 
Sent: Wednesday, April 01, 2015 6:14 PM
To: Muthyala, Vj; aschwaighofer at apple.com; spatel at rotateright.com; hfinkel at anl.gov
Cc: james.molloy at arm.com; ahmed.bougacha at gmail.com; llvm-commits at cs.uiuc.edu
Subject: Re: [PATCH] Generation of PSAD in LoopVectorizer

Hi Vijender,

Thanks for doing this! I have a bunch of comments below. High-level, why are you lowering this as an intrinsic rather than an IR-level pattern? I ask because we really do want this pattern recognized in straight-line (non vectorized) code too, and that is something the DAGCombiner could do.

[VJ] I got your concern and I do accept your point. But if I handle this in DAGCombiner there will be too many patterns to handle because there will be multiple paths reaching the DAGCombiner. For example if LV is enabled and VF is selected as 4, then the pattern contains instructions expanded to V4. Similar is the case with VF =8. If Vectorizer is disabled then the pattern will be scalar pattern. So basically we want to solve it in divide and conquer approach. Right now we have handled in LV and a serious effort is done by Shahid to handle it in SLP (because there will be cases where LV cannot handle due to loop unrolling and other reasons). Now the only remaining part will be non vectorize path which can be handled in DAGCombiner as you mentioned which will be the future work.

[VJ] The reason for replacing the pattern with an intrinsic is that we don’t want that pattern to be disturbed by other optimizations. 


Cheers,

James


REPOSITORY
  rL LLVM

================
Comment at: include/llvm/Analysis/TargetTransformInfo.h:455
@@ -454,1 +454,3 @@
 
+  /// \returns the cost of SAD instruction.
+  unsigned getSADInstrCost(Type *RetTy, Type *op1,
----------------
Please spell out "SAD" = "Sum of absolute differences" = "C += abs(A - B)" in the docstring, so it is clear (not all backends will use the same mnemonic).

================
Comment at: include/llvm/CodeGen/ISDOpcodes.h:647
@@ +646,3 @@
+    /// SAD - This corresponds to a Sum of Absolute Difference(SAD) instruction.
+    /// The operands are vectors of char type with length of vector being power
+    /// of 2 and the result is a scalar integer.
----------------
"of char type"? Do you mean i8? Why only i8? Why is the (scalar part of the) input type not the same as the output type?
[VJ] - The Sum of absolute difference instruction is only appropriate for char Types. It is a byte level reduction operation. If you have a look at PSAD in X86 or USAD in ARM, all happen at byte level.
[VJ] - The scalar type of output is not same as scalar type of input because sum of eight  i8's may not fit properly in an i8 without data lose.

================
Comment at: include/llvm/IR/Intrinsics.td:596 @@ +595,3 @@
+// Calculate the Sum of Absolute Differences (SAD) of the two input vectors.
+// Only vectors of char Type are allowed.
+def int_sad  : Intrinsic<[llvm_i32_ty],
----------------
Char? Why i32? why not i64? why not i8/i16?
[VJ] - I accept your point. We can make it to return any type.

================
Comment at: include/llvm/IR/Intrinsics.td:597 @@ +596,3 @@
+// Only vectors of char Type are allowed.
+def int_sad  : Intrinsic<[llvm_i32_ty],
+                         [llvm_anyvector_ty, llvm_anyvector_ty],
----------------
What is the signedness of this operation? Is there a floating point equivalent? If not, why not?
[VJ] - There is no floating point equivalent of SAD.
================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:630
@@ +629,3 @@
+    SpecialPatternDescriptor(SpecialPatternKind SPKind,
+                             SpecialActionKind SAK, unsigned miVF,
+                             unsigned maVF)
----------------
Just use "minVF" here, it won't conflict as long as you reference it in the initializer list.

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:642
@@ +641,3 @@
+    // this pattern. For example, for x86 the minimum VF for SAD is 8.
+    unsigned minVF;
+    // maxVF - Specifies the maximum VF for which the target supports
----------------
I don't like the assumption here that the legality of an operation is a range [min,max]. I don't think we should assume continuity in the range. Perhaps this should be a bitset of supported types instead?
[VJ] - Can you please explain how to solve this issue. Are you telling to have  supported types of the instruction in a list and check that instead of checking the VF factor?
================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:767
@@ +766,3 @@
+  /// which is part of special pattern in a basic block.
+  typedef DenseMap<Instruction *, SpecialPatternDescriptor> ActionList;
+
----------------
Shouldn't this be a multimap? I can imagine that there might be complex patterns and simple patterns, and a complex pattern may subsume a simple pattern. So an instruction could be part of two patterns (until one is selected).
[VJ] - Ok. How about checking the instruction whether it belongs to any other pattern before inserting in the ActionList. Because I think it will not make sense to keep that instruction in both the patterns. Rather we can select which pattern to associate that instruction before inserting. 
            What do you think? 

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:1014
@@ -942,1 +1013,3 @@
+  /// If there is no special pattern then the widest type is returned. 
+  unsigned selectBestTypeForPattern();
 
----------------
Uh, surely this needs to be just the type of the phi? If smaller types are legal, the phi should have a smaller type, right?
[VJ] Actually this value is required to select the maximum VF factor to try. So lower the value the higher the vectorization factors I can try.

Your proposed algorithm falls down in this case:

    int32_t a;
    int8_t b;
    int32_t *c;
    for (...) {
      a += abs(c[i] - c[i+1]);
      b += (int8_t)c[i];
    }

Oh dear, you'll select "i8" as your type but that is illegal (it's not part of this PHI).


================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:1050
@@ +1049,3 @@
+  /// for a given vector width.
+  unsigned SpecialPhiCost(PHINode *p, unsigned VF);
+
----------------
I generally don't like the use of "Special". Can we not think of a more descriptive name for it, like "Pattern"?
[VJ] - This "Special" word was suggested in the previous review comments in RFC.

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:2556
@@ -2478,2 +2555,3 @@
   case RK_IntegerOr:
+  case RK_IntegerSpecial:
     // Adding, Xoring, Oring zero to a number does not change it.
----------------
Well this isn't right - Who says the identity for all pattern based reductions is integer zero?
[VJ] - I got your point. I need to send the Phi node as one of the arguments to this function to know more about the type and return the respective identity. Will work on it.
================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:2582
@@ -2503,1 +2581,3 @@
       return Instruction::Add;
+    case LoopVectorizationLegality::RK_IntegerSpecial:
+      return Instruction::Add;
----------------
If it assumed that all patterns can be added together, that isn't documented anywhere. 
[VJ] - I got your point. I need to send the Phi node as one of the arguments to this function to know more about the type and return the respective instruction. Will work on it.
================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:3014
@@ -2876,3 +3013,3 @@
 
-    if (VF > 1) {
+    if (VF > 1 && !isSAD) {
       // VF is a power of 2 so we can emit the reduction using log2(VF) shuffles
----------------
I don't like this at all. You've introduced a new generic pattern type then binned that and special-cased for SAD. Why is SAD special? why is it different from ADD for this case?
[VJ] the code inside the IF block reduces the vector to a single scalar value. Here my SAD intrinsic output is already a scalar. I need to skip this merger block for SAD.
================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:3295
@@ -3157,2 +3294,3 @@
   // For each instruction in the old loop.
+  SmallVector<Value *, 4> args;
   for (BasicBlock::iterator it = BB->begin(), e = BB->end(); it != e; ++it) {
----------------
Args

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:3312
@@ +3311,3 @@
+          assert(0 && "Cannot be Invalid here!!");
+          break;
+        }
----------------
No default case?

================
Comment at: lib/Transforms/Vectorize/LoopVectorize.cpp:4862
@@ +4861,3 @@
+          switch(SPD.Kind) {
+            case LoopVectorizationLegality::SPK_Sad:
+              specialPatternWidth = 
+ std::min(specialPatternWidth,(unsigned)8);
----------------
I think here you're special casing the SpecialPattern stuff when really it shouldn't be special-cased - this is a problem (finding the smallest valid type for an operation) that applies in many cases to most operators, not just your patterns.
[VJ] - So shall I replace the getWidestType in the code with the getSmallestType so that it can try all the vectorization factors? Anyways cost modeling already handles everything properly so I feel there is no harm in  using getSmallestType.

I think it gives the biggest impact in ARM/MIPS world though as i8 and i16 aren't legal scalar types for us.

http://reviews.llvm.org/D8136

EMAIL PREFERENCES
  http://reviews.llvm.org/settings/panel/emailpreferences/