[llvm-dev] Lowering llvm.memset for ARM target

Evgeny Astigeevich via llvm-dev llvm-dev at lists.llvm.org
Mon Sep 11 06:27:44 PDT 2017


Hi Bharathi,

'-mfpu=vfp ' is  the root cause of the problem. It means:  VFPv2, disabled Advanced SIMD extension.
The intrinsic is lowered into stores only when Advanced SIMD extension is enabled. So, if your target supports the Advanced SIMD extension, the workaround is '-mfpu=neon'.
I'll check what is happening when the Advanced SIMD extension is disabled.

Thanks,
Evgeny

> -----Original Message-----
> From: Bharathi Seshadri [mailto:bharathi.seshadri at gmail.com]
> Sent: Friday, September 08, 2017 9:39 PM
> To: Evgeny Astigeevich
> Subject: Re: [llvm-dev] Lowering llvm.memset for ARM target
> 
> Hi Evgeny,
> 
> Even for a litte-endian ARM target,  I don't see that the intrinsic is lowered
> into stores. I checked with llvm38, llvm40 and a somewhat recent trunk
> (about a month old). I'm not sure what I'm missing.
> 
> For my test case compiled using  -O3 -c --target=arm-linux-gnueabi -
> march=armv8-a+crc -mfloat-abi=hard -no-integrated-as -mfpu=vfp, I get
> 
> bash-4.1$ cat trymem2.c
> typedef struct {
>     int v0[100];
> } test;
> #define LIMIT 9
> void init(test *t)
> {
>     int i;
>     for (i = 0; i < LIMIT ; i++)
>       t->v0[i] = 0;
> }
> int main() {
> test t;
> init(&t);
> return 0;
> }
> 
> 
> $objdump -d
> 00000000 <init>:
>    0:   e92d4800        push    {fp, lr}
>    4:   e1a0b00d        mov     fp, sp
>    8:   e3a01000        mov     r1, #0
>    c:   e3a02024        mov     r2, #36 ; 0x24
>   10:   ebfffffe        bl      0 <memset> <====== Call to memset
>   14:   e8bd8800        pop     {fp, pc}
> 00000018 <main>:
>   18:   e3a00000        mov     r0, #0
>   1c:   e12fff1e        bx      lr
> 
> With my patched clang to modify the MaxMemsetPerStores for ARM to 16, I
> get Disassembly of section .text:
> 00000000 <init>:
>    0:   e3a01000        mov     r1, #0
>    4:   e5801020        str     r1, [r0, #32]
>    8:   e5801004        str     r1, [r0, #4]
>    c:   e5801008        str     r1, [r0, #8]
>   10:   e580100c        str     r1, [r0, #12]
>   14:   e5801010        str     r1, [r0, #16]
>   18:   e5801014        str     r1, [r0, #20]
>   1c:   e5801018        str     r1, [r0, #24]
>   20:   e580101c        str     r1, [r0, #28]
>   24:   e5801000        str     r1, [r0]
>   28:   e12fff1e        bx      lr
> 0000002c <main>:
>   2c:   e3a00000        mov     r0, #0
>   30:   e12fff1e        bx      lr
> 
> 
> Thanks,
> 
> Bharathi
> 
> On Fri, Sep 8, 2017 at 8:22 AM, Evgeny Astigeevich
> <Evgeny.Astigeevich at arm.com> wrote:
> > Hi Bharathi,
> >
> > From the discussion you provided I found that the issue happens for a big-
> endian ARM target.
> > For the little-endian target the intrinsic in your test case is lowered to store
> instructions.
> > Some debugging is needed to figure out why it's not happening for big-
> endian.
> >
> > -Evgeny
> >
> >> -----Original Message-----
> >> From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of
> >> Evgeny Astigeevich via llvm-dev
> >> Sent: Thursday, September 07, 2017 4:25 PM
> >> To: Bharathi Seshadri
> >> Cc: llvm-dev; nd
> >> Subject: Re: [llvm-dev] Lowering llvm.memset for ARM target
> >>
> >> Hi Bharathi,
> >>
> >> MaxStoresPerMemset  was changed from 16 to 8 in r 169791. The commit
> >> comment:
> >>
> >> "Some enhancements for memcpy / memset inline expansion.
> >> 1. Teach it to use overlapping unaligned load / store to copy / set the
> trailing
> >>    bytes. e.g. On 86, use two pairs of movups / movaps for 17 - 31 byte
> copies.
> >> 2. Use f64 for memcpy / memset on targets where i64 is not legal but f64
> is.
> >> e.g.
> >>    x86 and ARM.
> >> 3. When memcpy from a constant string, do *not* replace the load with
> >> a constant
> >>    if it's not possible to materialize an integer immediate with a single
> >>    instruction (required a new target hook: TLI.isIntImmLegal()).
> >> 4. Use unaligned load / stores more aggressively if target hooks
> >> indicates they
> >>    are "fast".
> >> 5. Update ARM target hooks to use unaligned load / stores. e.g.
> >> vld1.8 / vst1.8.
> >>    Also increase the threshold to something reasonable (8 for memset, 4
> pairs
> >>    for memcpy).
> >>
> >> This significantly improves Dhrystone, up to 50% on ARM iOS devices.
> >>
> >> rdar://12760078"
> >>
> >> It's strange. According to the comment the threshold was increased
> >> but it is decreased. I think the code needs to be revisited and
> benchmarked.
> >> I'll do some benchmarking.
> >>
> >> Thanks,
> >> Evgeny Astigeevich | Arm Compiler Optimization Team Lead
> >>
> >>
> >> > -----Original Message-----
> >> > From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf
> >> > Of Bharathi Seshadri via llvm-dev
> >> > Sent: Tuesday, September 05, 2017 8:24 PM
> >> > To: llvm-dev at lists.llvm.org
> >> > Subject: [llvm-dev] Lowering llvm.memset for ARM target
> >> >
> >> > As reported in an earlier thread
> >> > (http://clang-developers.42468.n3.nabble.com/Disable-memset-
> synthes
> >> > is-
> >> > tp4057810.html),
> >> > we noticed in some cases that the llvm.memset intrinsic, if lowered
> >> > to stores, could help with performance.
> >> >
> >> > Here's a test case: If LIMIT is > 8, I see that a call to memset is
> >> > emitted for arm & aarch64, but not for x86 target.
> >> >
> >> > typedef struct {
> >> >     int v0[100];
> >> > } test;
> >> > #define LIMIT 9
> >> > void init(test *t)
> >> > {
> >> >     int i;
> >> >     for (i = 0; i < LIMIT ; i++)
> >> >       t->v0[i] = 0;
> >> > }
> >> > int main() {
> >> > test t;
> >> > init(&t);
> >> > return 0;
> >> > }
> >> >
> >> > Looking at the llvm sources, I see that there are two key target
> >> > specific variables, MaxStoresPerMemset and
> >> MaxStoresPerMemsetOptSize,
> >> > that determine if the intrinsic llvm.memset can be lowered into
> >> > store
> >> operations.
> >> > For ARM, these variables are set to 8 and 4 respectively.
> >> >
> >> > I do not know as to how the default values for these two variables
> >> > are arrived at, but doubling these values (similar to that for the
> >> > x86
> >> > target) seems to help our case and we observe a 7% increase in
> >> > performance of our networking application. We use -O3 and -flto and
> >> > 32-bit
> >> arm.
> >> >
> >> > I can prepare a patch and post for review if such a change, say
> >> > under CodeGenOpt::Aggressive would be acceptable.
> >> >
> >> > Thanks,
> >> > Bharathi
> >> > _______________________________________________
> >> > LLVM Developers mailing list
> >> > llvm-dev at lists.llvm.org
> >> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> >> _______________________________________________
> >> LLVM Developers mailing list
> >> llvm-dev at lists.llvm.org
> >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev


More information about the llvm-dev mailing list