[llvm-dev] Lowering llvm.memset for ARM target

Evgeny Astigeevich via llvm-dev llvm-dev at lists.llvm.org
Fri Sep 8 08:22:08 PDT 2017


Hi Bharathi,

From the discussion you provided I found that the issue happens for a big-endian ARM target.
For the little-endian target the intrinsic in your test case is lowered to store instructions.
Some debugging is needed to figure out why it's not happening for big-endian.

-Evgeny

> -----Original Message-----
> From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of
> Evgeny Astigeevich via llvm-dev
> Sent: Thursday, September 07, 2017 4:25 PM
> To: Bharathi Seshadri
> Cc: llvm-dev; nd
> Subject: Re: [llvm-dev] Lowering llvm.memset for ARM target
> 
> Hi Bharathi,
> 
> MaxStoresPerMemset  was changed from 16 to 8 in r 169791. The commit
> comment:
> 
> "Some enhancements for memcpy / memset inline expansion.
> 1. Teach it to use overlapping unaligned load / store to copy / set the trailing
>    bytes. e.g. On 86, use two pairs of movups / movaps for 17 - 31 byte copies.
> 2. Use f64 for memcpy / memset on targets where i64 is not legal but f64 is.
> e.g.
>    x86 and ARM.
> 3. When memcpy from a constant string, do *not* replace the load with a
> constant
>    if it's not possible to materialize an integer immediate with a single
>    instruction (required a new target hook: TLI.isIntImmLegal()).
> 4. Use unaligned load / stores more aggressively if target hooks indicates
> they
>    are "fast".
> 5. Update ARM target hooks to use unaligned load / stores. e.g. vld1.8 /
> vst1.8.
>    Also increase the threshold to something reasonable (8 for memset, 4 pairs
>    for memcpy).
> 
> This significantly improves Dhrystone, up to 50% on ARM iOS devices.
> 
> rdar://12760078"
> 
> It's strange. According to the comment the threshold was increased but it is
> decreased. I think the code needs to be revisited and benchmarked.
> I'll do some benchmarking.
> 
> Thanks,
> Evgeny Astigeevich | Arm Compiler Optimization Team Lead
> 
> 
> > -----Original Message-----
> > From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of
> > Bharathi Seshadri via llvm-dev
> > Sent: Tuesday, September 05, 2017 8:24 PM
> > To: llvm-dev at lists.llvm.org
> > Subject: [llvm-dev] Lowering llvm.memset for ARM target
> >
> > As reported in an earlier thread
> > (http://clang-developers.42468.n3.nabble.com/Disable-memset-synthesis-
> > tp4057810.html),
> > we noticed in some cases that the llvm.memset intrinsic, if lowered to
> > stores, could help with performance.
> >
> > Here's a test case: If LIMIT is > 8, I see that a call to memset is
> > emitted for arm & aarch64, but not for x86 target.
> >
> > typedef struct {
> >     int v0[100];
> > } test;
> > #define LIMIT 9
> > void init(test *t)
> > {
> >     int i;
> >     for (i = 0; i < LIMIT ; i++)
> >       t->v0[i] = 0;
> > }
> > int main() {
> > test t;
> > init(&t);
> > return 0;
> > }
> >
> > Looking at the llvm sources, I see that there are two key target
> > specific variables, MaxStoresPerMemset and
> MaxStoresPerMemsetOptSize,
> > that determine if the intrinsic llvm.memset can be lowered into store
> operations.
> > For ARM, these variables are set to 8 and 4 respectively.
> >
> > I do not know as to how the default values for these two variables are
> > arrived at, but doubling these values (similar to that for the x86
> > target) seems to help our case and we observe a 7% increase in
> > performance of our networking application. We use -O3 and -flto and 32-bit
> arm.
> >
> > I can prepare a patch and post for review if such a change, say under
> > CodeGenOpt::Aggressive would be acceptable.
> >
> > Thanks,
> > Bharathi
> > _______________________________________________
> > LLVM Developers mailing list
> > llvm-dev at lists.llvm.org
> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev


More information about the llvm-dev mailing list