[PATCH] D22792: VecClone Pass

Thu Dec 14 20:45:53 PST 2017

hfinkel added a comment.

In general, I think the VecClone pass is too complicated because it tries to handle the "optimized code" vs. "non-optimized code" cases separately. I don't think we should (or, in a theoretical sense, can) do that. We should have a uniform algorithm to handle all incoming IR. I think that we can do something like this:

1. Split the entry block at the top and move all allocas in the original entry block to the new entry block.
2. For each constant-sized alloca in the entry block, expand it by a factor of VL. For vectorizable types, you can do an alloca of the vector type. Otherwise, use an alloca to generate an array of VL items (i.e., use VL as the alloca's second parameter). Generate an alloca of <RetTy x VL> to hold the return value.
3. Generate a new loop around all of the rest of the function (for i = [0, VL-1]) and a new return block (which loads the value from the return-value alloca and returns it).
4. Replace all uses of each entry-block alloca, a, with &a[i]. Remove the old allocas.
5. All uses of function parameters are now inside the loop. vector parameters will get a vector alloca and store in the entry block, and uses in the loop of the parameter will be replaced by load param[i]. linear parameters get replaced by (p+i). uniform parameters are unchanged.
6. Replace the returns with a store to RetVal[i] and a branch to the loop-exit block.

Something like that should work for all functions, optimized or not.

================
Comment at: include/llvm/Analysis/TargetTransformInfo.h:633
+  /// \brief Targets defined in the vector function ABI.
+  enum TargetProcessor {
+    Pentium4,      // ISA extension = SSE2,     ISA class = XMM
----------------
This is X86 specific, and doesn't seem to belong here (also, it's not used anywhere).

================
Comment at: include/llvm/Analysis/TargetTransformInfo.h:648
+  /// ISA classes defined in the vector function ABI.
+  enum ISAClass {
+    XMM,  // (SSE2)
----------------
This shouldn't be an enum like this. We don't need X86-specific things in the TTI interface. It looks like you've done a good job at making this opaque, so you can move this enum itself into the X86TTI implementation. The only thing we need here is the definition of UnknownISA (maybe define that to be an integer = -1).

================
Comment at: include/llvm/Analysis/TargetTransformInfo.h:653
+    ZMM,  // (MIC)
+    UnknownIsa,
+    ISAClassesNum
----------------
UnknownISA

================
Comment at: include/llvm/Analysis/TargetTransformInfo.h:917
+  /// \returns The maximum vector register width for \p IsaClass.
+  unsigned ISAClassMaxRegisterWidth(ISAClass Class) const;
+
----------------
This shouldn't start with a capital letter. Either name it `isaClassMaxRegisterWidth` or `getISAClassMaxRegisterWidth`. I prefer the latter.

================
Comment at: include/llvm/Analysis/TargetTransformInfo.h:920
+  /// \returns The ISA class as a string.
+  std::string ISAClassToString(ISAClass Class) const;
+
----------------
This shouldn't start with a capital letter. Either name it `isaClassToString` or `convertISAClassToString`. I prefer the latter.

================
Comment at: include/llvm/Analysis/VectorVariant.h:34
+
+#define POSITIVE 1
+#define NEGATIVE -1
----------------
These preprocessor should just be constexpr/static const ints.

================
Comment at: include/llvm/Transforms/Utils/VecClone.h:26
+enum InstType {
+  ALLOCA = 0,
+  STORE,
----------------
Don't use all caps for these names. IT_Alloca or Alloca could work.

================
Comment at: include/llvm/Transforms/Utils/VecClone.h:72
+
+    /// \brief Make a copy of the function if it is marked as SIMD.
+    Function* CloneFunction(Function &F, VectorVariant &V);
----------------
Please use consistent terminology: marked as SIMD -> requires a vector variant.

================
Comment at: lib/Transforms/IPO/PassManagerBuilder.cpp:35
 #include "llvm/Transforms/IPO.h"
 #include "llvm/Transforms/IPO/ForceFunctionAttrs.h"
 #include "llvm/Transforms/IPO/FunctionAttrs.h"
----------------
You also need to add the pass to the new pass manager: lib/Passes/PassBuilder.cpp

================
Comment at: lib/Transforms/IPO/PassManagerBuilder.cpp:98

+static cl::opt<bool> RunVecClone("enable-vec-clone", cl::init(false), cl::Hidden,
+                                 cl::desc("Run Vector Function Cloning"));
----------------
I wouldn't bother with this. It should be safe if no functions with the vector-variants attribute are present (and, if they are, then you need to run the pass of things might not link).

================
Comment at: lib/Transforms/Utils/VecClone.cpp:86
+///   }
+///   return vec_ret;
+/// }
----------------
What cleans up the extra loads/stores here (if we're optimizing)? Are assuming that there's a run of SROA afterwards, or does InstCombine do a good job here, or something else?

================
Comment at: lib/Transforms/Utils/VecClone.cpp:162
+
+  // FuncVars will contain a 1-many mapping between the original scalar
+  // function and the vector variant encoding strings (represented as
----------------
1-many -> one-to-many

================
Comment at: lib/Transforms/Utils/VecClone.cpp:289
+
+  std::string VariantName = V.generateFunctionName(F.getName());
+  Function *Clone = Function::Create(
----------------
Move this near the top of the function and check, before you generate code, that he function doesn't already exist. If it does, bail out (e.g., return a nullptr and the caller can move on to the next variant/function).

================
Comment at: lib/Transforms/Utils/VecClone.cpp:345
+
+    for (; ArgListIt != ArgListEnd; ++ArgListIt) {
+      unsigned ParmIdx = ArgListIt->getArgNo();
----------------
I don't think you need the iterators; this can just be:

  for (auto &Arg : Clone->args())

and you use this pattern a lot (declaring two iterators and then using them in a for loop). In almost all of these cases, you should use a range-based for loop instead.

================
Comment at: lib/Transforms/Utils/VecClone.cpp:418
+
+BasicBlock *VecClone::splitLoopIntoReturn(Function *Clone,
+                                          BasicBlock *LoopBlock) {
----------------
What happens if there's more than one return in the function? You might want, in that case, to create a new block with the return and convert all other returns to branches to that block.

================
Comment at: lib/Transforms/Utils/VecClone.cpp:483
+
+      BranchInst *Branch = dyn_cast<BranchInst>(BBI);
+
----------------
In addition to branches, you need to handle SwitchInst and IndirectBrInst (and, for the latter, you need to find any place where the address of the return block is taken, and replace it with the address of the LoopExitBlock).

================
Comment at: lib/Transforms/Utils/VecClone.cpp:534
+  Instruction *Induction =
+      BinaryOperator::CreateNUWAdd(Phi, Inc, "indvar", LoopExitBlock);
+
----------------
Use CreateAdd here so you can set both nuw and nsw on this increment.

================
Comment at: lib/Transforms/Utils/VecClone.cpp:638
+            PRef->VectorParm = Alloca;
+            break;
+          }
----------------
What happens if there is more than one store user?

https://reviews.llvm.org/D22792