[PATCH] D64805: [AArch64] Expand bcmp() for small comparisons

Mon Aug 5 08:46:55 PDT 2019

SjoerdMeijer added inline comments.

================
Comment at: llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp:625
+  Options.MaxNumLoads = TLI->getMaxExpandSizeMemcmp(OptSize);
+  Options.NumLoadsPerBlock = 2;
+  Options.AllowOverlappingLoads = true;
----------------
evandro wrote:
> courbet wrote:
> > brzycki wrote:
> > > SjoerdMeijer wrote:
> > > > why 2?
> > > @evandro selected these based on the x8_64 values set in `llvm/lib/Target/X86/X86TargetTransformInfo.cpp`:
> > > ```
> > >   Options.MaxNumLoads = TLI->getMaxExpandSizeMemcmp(OptSize);
> > >   Options.NumLoadsPerBlock = 2;
> > >   // ...
> > >   if (ST->is64Bit()) {
> > >     Options.LoadSizes.push_back(8);
> > >   }
> > >   Options.LoadSizes.push_back(4);
> > >   Options.LoadSizes.push_back(2);
> > >   Options.LoadSizes.push_back(1);
> > > ```
> > > 
> > > I do not know why x86_64 selected these values.
> > This does trades loads for jumps: with this setting, each block will be twice as big (more work per block) but there will be half as many blocks.
> > 
> > This was found to be more efficient in general on x86, and might not be true for arm. This will have to be benchmarked. 
> > 
> > 
> As @courbet said, the best figure may be different, but this seemed like a sensible starting point.  Since A64 has more registers, methinks that a higher value might be beneficial, but it depends on how many live registers there are.
These options look a lot more sensible now as TLI is used, but has this actually been benchmarked?

================
Comment at: llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp:626
+  Options.NumLoadsPerBlock = 2;
+  Options.AllowOverlappingLoads = true;
+  Options.LoadSizes.push_back(8);
----------------
evandro wrote:
> courbet wrote:
> > brzycki wrote:
> > > SjoerdMeijer wrote:
> > > > How about vector loads and alignment? Can we generate them for this, and do we then also need 16 byte loads to loadsizes?
> > > Interesting question, I'll defer to @evandro when he can comment on the ticket. There is a bit more logic in x86_64 he stripped out when building this patch:
> > > ```
> > >   if (IsZeroCmp) {
> > >     // Only enable vector loads for equality comparison. Right now the vector
> > >     // version is not as fast for three way compare (see #33329).
> > >     // TODO: enable AVX512 when the DAG is ready.
> > >     // if (ST->hasAVX512()) Options.LoadSizes.push_back(64);
> > >     const unsigned PreferredWidth = ST->getPreferVectorWidth();
> > >     if (PreferredWidth >= 256 && ST->hasAVX2()) Options.LoadSizes.push_back(32);
> > >     if (PreferredWidth >= 128 && ST->hasSSE2()) Options.LoadSizes.push_back(16);
> > >     // All GPR and vector loads can be unaligned. SIMD compare requires integer
> > >     // vectors (SSE2/AVX2).
> > >     Options.AllowOverlappingLoads = true;
> > >   }
> > > ```
> > For x86, vector loads are enabled only for equality comparison because there is no efficient vector-sized bswap.
> > 
> > I don't know much about arm, but:
> > 
> >  * if vector loads and equality comparisons are efficient there is no reason not to use them if IsZeroCmp
> >  * if there is an efficient vector-sized bswap vector sizes can be used in both cases
> > 
> > 
> Vector loads are efficient, performance wise, on A64, but typically not so much power wise.  That's why I'm not adding them just yet. If so, then vector byte swaps may be used.
Adding this comment as a `// FIXME` or `// TODO` would be good I guess. 

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D64805/new/

https://reviews.llvm.org/D64805