[PATCH] D36059: [memops] Add a new pass to inject fast-path code for specific library function calls.

Thu Aug 3 21:04:02 PDT 2017

davidxl added a comment.

Does it penalize cases on the border line? For instance, if a 7 byte memcpy requires a byte copy loop of 7 iterations which is above the threshold. In this case, the runtime check will purely add runtime overhead.  This is unlike more general partial inlining of stringop operations where we may have internal APIs to by pass similar checks inside the library function.

================
Comment at: include/llvm/Analysis/TargetTransformInfo.h:799
+    /// This must always be a power of two.
+    int OpByteSize;
+
----------------
 nit: I find OpByteSize not intuitive. Perhaps  DataByteWidth?

================
Comment at: lib/Target/X86/X86TargetTransformInfo.cpp:2235
+  // On x86, a loop of up to six iterations remains profitable compared to the
+  // overhead of a library call unless more than 16 bytes are being touched at
+  // which point the vector based code in the library call is advantageous even
----------------
Reference ? The overhead here is vague. Does it include PLT or not? or does it mean the set up cost of rep mov/sto?

================
Comment at: lib/Transforms/Scalar/FastPathLibCalls.cpp:102
+    int Depth = 0;
+    for (DomTreeNode *N = DT.getNode(I.getParent()); N && Depth < 10;
+         N = N->getIDom()) {
----------------
Why magic 10? In reality, I would think 2 or 3 is enough. Also add an internal option for this?

================
Comment at: lib/Transforms/Scalar/FastPathLibCalls.cpp:111
+      if (!Cmp || Cmp->getOperand(0) != V ||
+          Cmp->getOperand(1) != ConstantInt::get(V->getType(), 0))
+        continue;
----------------
Why limiting to comparison with zero?    More generally, why not collect more general predicate info such that the second size check can also be eliminated?  or skip the fast path if size is known to be large?

================
Comment at: lib/Transforms/Scalar/FastPathLibCalls.cpp:123
+      case ICmpInst::ICMP_SGT:
+      case ICmpInst::ICMP_SLT:
+        // Predicates where a match precludes equality with zero.
----------------
 < 0 means a huge unsigned length. In this case should you skip the fast path completely?

================
Comment at: lib/Transforms/Scalar/FastPathLibCalls.cpp:179
+    using namespace PatternMatch;
+    if (match(I.getLength(),
+              m_NUWShl(m_Value(Count), m_ConstantInt(ShiftScaleC)))) {
----------------
Is this pattern common?

================
Comment at: lib/Transforms/Scalar/FastPathLibCalls.cpp:191
+    int Alignment = std::max<int>(1, I.getAlignment());
+    int MaxOpByteSize = std::min<int>(1 << ShiftScale, Alignment);
+
----------------
how about on targets where unaligned access is ok?

================
Comment at: lib/Transforms/Scalar/FastPathLibCalls.cpp:207
+
+    // Otherwise build an actual fast path.
+    ++NumFastPaths;
----------------
The function is pretty large. Perhaps split the analysis and transformation?

================
Comment at: lib/Transforms/Scalar/FastPathLibCalls.cpp:314
+
+    // Now build the inner part of the fastpath for this rotine.
+    IRBuilder<> IRB(FastPath->ThenBB->getFirstNonPHI());
----------------
Can you make this a callback invoked by buildFastPathMemOpFramework?

https://reviews.llvm.org/D36059