[llvm-dev] Lowering llvm.memset for ARM target

Thu Sep 7 08:24:46 PDT 2017

Hi Bharathi,

MaxStoresPerMemset  was changed from 16 to 8 in r 169791. The commit comment:

"Some enhancements for memcpy / memset inline expansion.
1. Teach it to use overlapping unaligned load / store to copy / set the trailing
   bytes. e.g. On 86, use two pairs of movups / movaps for 17 - 31 byte copies.
2. Use f64 for memcpy / memset on targets where i64 is not legal but f64 is. e.g.
   x86 and ARM.
3. When memcpy from a constant string, do *not* replace the load with a constant
   if it's not possible to materialize an integer immediate with a single
   instruction (required a new target hook: TLI.isIntImmLegal()).
4. Use unaligned load / stores more aggressively if target hooks indicates they
   are "fast".
5. Update ARM target hooks to use unaligned load / stores. e.g. vld1.8 / vst1.8.
   Also increase the threshold to something reasonable (8 for memset, 4 pairs
   for memcpy).

This significantly improves Dhrystone, up to 50% on ARM iOS devices.

rdar://12760078"

It's strange. According to the comment the threshold was increased but it is decreased. I think the code needs to be revisited and benchmarked. 
I'll do some benchmarking.

Thanks,
Evgeny Astigeevich | Arm Compiler Optimization Team Lead

> -----Original Message-----
> From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of
> Bharathi Seshadri via llvm-dev
> Sent: Tuesday, September 05, 2017 8:24 PM
> To: llvm-dev at lists.llvm.org
> Subject: [llvm-dev] Lowering llvm.memset for ARM target
> 
> As reported in an earlier thread
> (http://clang-developers.42468.n3.nabble.com/Disable-memset-synthesis-
> tp4057810.html),
> we noticed in some cases that the llvm.memset intrinsic, if lowered to stores,
> could help with performance.
> 
> Here's a test case: If LIMIT is > 8, I see that a call to memset is emitted for arm
> & aarch64, but not for x86 target.
> 
> typedef struct {
>     int v0[100];
> } test;
> #define LIMIT 9
> void init(test *t)
> {
>     int i;
>     for (i = 0; i < LIMIT ; i++)
>       t->v0[i] = 0;
> }
> int main() {
> test t;
> init(&t);
> return 0;
> }
> 
> Looking at the llvm sources, I see that there are two key target specific
> variables, MaxStoresPerMemset and MaxStoresPerMemsetOptSize, that
> determine if the intrinsic llvm.memset can be lowered into store operations.
> For ARM, these variables are set to 8 and 4 respectively.
> 
> I do not know as to how the default values for these two variables are
> arrived at, but doubling these values (similar to that for the x86
> target) seems to help our case and we observe a 7% increase in performance
> of our networking application. We use -O3 and -flto and 32-bit arm.
> 
> I can prepare a patch and post for review if such a change, say under
> CodeGenOpt::Aggressive would be acceptable.
> 
> Thanks,
> Bharathi
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev