[PATCH] D55263: [CodeGen][ExpandMemcmp] Add an option for allowing overlapping loads.

Mon Dec 10 07:39:18 PST 2018

spatel added a comment.

In D55263#1325283 <https://reviews.llvm.org/D55263#1325283>, @davezarzycki wrote:

> In D55263#1320043 <https://reviews.llvm.org/D55263#1320043>, @spatel wrote:
>
> > I just looked over the codegen changes so far, but I want to add some more knowledgeable x86 hackers to have a look too. There are 2 concerns:
> >
> > 1. Are there any known uarch problems with overlapping loads?
> > 2. Are there any known uarch problems with unaligned accesses (either scalar or SSE)?
> >
> >   If there's any perf data (either nano-benchmarks or full apps) to support the changes, that would be nice to see. This reminds me of PR33329: https://bugs.llvm.org/show_bug.cgi?id=33329 (can we close that now?)
>
>
> Let's not lose sight of the big picture here. If uarch problems exist, are they *worse* than the cost of calling memcmp()? In other words, is the likely register spills, function call overhead, and dynamic algorithm selection (given the constantness of the size parameter is lost) worth it?

No real argument from me - I just wanted to be sure that nobody knew of some pre-existing uarch disaster potential. The change should be a nice win for the general x86 case. I've just added some nits in the inine comments.

================
Comment at: include/llvm/Analysis/TargetTransformInfo.h:584
     SmallVector<unsigned, 8> LoadSizes;
+    // Whether to allow overlapping loads, e.g. 7-byte compares can be done with
+    // two 4-byte compares instead of 4+2+1-byte compares. This requires all
----------------
Rephrase: "Set to true to allow overlapping loads. For example, ..."

================
Comment at: lib/CodeGen/ExpandMemCmp.cpp:76
+    unsigned LoadSize;
     // The offset of this load WRT the base pointer, in bytes.
+    uint64_t Offset;
----------------
Could be independent of this patch, but it would be less cryptic to change "WRT" to "from" since we're making changes here.

================
Comment at: lib/CodeGen/ExpandMemCmp.cpp:157
+  Size = Size - NumNonOverlappingLoads * MaxLoadSize;
+  if (NumNonOverlappingLoads + !!(Size > 0) > MaxNumLoads)
+    return {};
----------------
This could use an explanatory comment and/or example with small numbers.
We can assert that MaxLoadSize > 1 on entry to this function? 
If Size == 0 at this point, shouldn't we return immediately because that's not an overlapping sequence? Or can that be asserted? That would mean that the optimal greedy (non-overlapping) sequence was not already found.

================
Comment at: lib/CodeGen/ExpandMemCmp.cpp:249

+/// Returns a pointer to an element of type `LoadSizeType` at offset
+/// `OffsetBytes`.
----------------
Returns -> Return

================
Comment at: lib/Target/X86/X86TargetTransformInfo.cpp:2892-2893
     TTI::MemCmpExpansionOptions Options;
     // TODO: enable AVX512 when the DAG is ready.
     // if (ST->hasAVX512()) Options.LoadSizes.push_back(64);
     if (ST->hasAVX2()) Options.LoadSizes.push_back(32);
----------------
Independent of this patch, but I think this can be enabled. See:
rL342989

Repository:
  rL LLVM

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D55263/new/

https://reviews.llvm.org/D55263