[llvm-dev] Lowering llvm.memset for ARM target

Wed Oct 25 08:22:53 PDT 2017

Hi Bharathi,

I did some debugging, the current problem is that the same threshold values are used for SIMD and non-SIMD memory instructions.
For the test you provided, when the SIMD extension is disabled the current implementation of llvm.memset lowering finds out 9 store instructions will be required. As 9 > 8 llvm.memset is lowered to a call.
When the SIMD is enabled, it finds out 3 stores (two vst1.32 + one str) will be enough. As 3 < 8 llvm.memset is lowered to a sequence of stores.

So before changing the threshold values we need to figure out:

1. Do we need separate thresholds for SIMD and non-SIMD memory instructions? For example, 8 for SIMD and 10-16 for non-SIMD. Some benchmarking is needed to find proper value.
2. If we keep the single value, how much it should be increased? This might affect performance of SIMD using applications. So benchmarking again.
3. If STRD are used then only 5 instructions are needed and llvm.memset is lowered as expected. I don’t know why the variant with STRD is not considered, maybe to avoid register pressure. 

Hope this helps.

Thanks,
Evgeny Astigeevich 

-----Original Message-----
From: Evgeny Astigeevich <Evgeny.Astigeevich at arm.com>
Date: Monday, 11 September 2017 at 14:27
To: Bharathi Seshadri <bharathi.seshadri at gmail.com>
Cc: llvm-dev <llvm-dev at lists.llvm.org>, nd <nd at arm.com>
Subject: RE: [llvm-dev] Lowering llvm.memset for ARM target

    Hi Bharathi,

    '-mfpu=vfp ' is  the root cause of the problem. It means:  VFPv2, disabled Advanced SIMD extension.
    The intrinsic is lowered into stores only when Advanced SIMD extension is enabled. So, if your target supports the Advanced SIMD extension, the workaround is '-mfpu=neon'.
    I'll check what is happening when the Advanced SIMD extension is disabled.

    Thanks,
    Evgeny

    > -----Original Message-----
    > From: Bharathi Seshadri [mailto:bharathi.seshadri at gmail.com]
    > Sent: Friday, September 08, 2017 9:39 PM
    > To: Evgeny Astigeevich
    > Subject: Re: [llvm-dev] Lowering llvm.memset for ARM target
    > 
    > Hi Evgeny,
    > 
    > Even for a litte-endian ARM target,  I don't see that the intrinsic is lowered
    > into stores. I checked with llvm38, llvm40 and a somewhat recent trunk
    > (about a month old). I'm not sure what I'm missing.
    > 
    > For my test case compiled using  -O3 -c --target=arm-linux-gnueabi -
    > march=armv8-a+crc -mfloat-abi=hard -no-integrated-as -mfpu=vfp, I get
    > 
    > bash-4.1$ cat trymem2.c
    > typedef struct {
    >     int v0[100];
    > } test;
    > #define LIMIT 9
    > void init(test *t)
    > {
    >     int i;
    >     for (i = 0; i < LIMIT ; i++)
    >       t->v0[i] = 0;
    > }
    > int main() {
    > test t;
    > init(&t);
    > return 0;
    > }
    > 
    > 
    > $objdump -d
    > 00000000 <init>:
    >    0:   e92d4800        push    {fp, lr}
    >    4:   e1a0b00d        mov     fp, sp
    >    8:   e3a01000        mov     r1, #0
    >    c:   e3a02024        mov     r2, #36 ; 0x24
    >   10:   ebfffffe        bl      0 <memset> <====== Call to memset
    >   14:   e8bd8800        pop     {fp, pc}
    > 00000018 <main>:
    >   18:   e3a00000        mov     r0, #0
    >   1c:   e12fff1e        bx      lr
    > 
    > With my patched clang to modify the MaxMemsetPerStores for ARM to 16, I
    > get Disassembly of section .text:
    > 00000000 <init>:
    >    0:   e3a01000        mov     r1, #0
    >    4:   e5801020        str     r1, [r0, #32]
    >    8:   e5801004        str     r1, [r0, #4]
    >    c:   e5801008        str     r1, [r0, #8]
    >   10:   e580100c        str     r1, [r0, #12]
    >   14:   e5801010        str     r1, [r0, #16]
    >   18:   e5801014        str     r1, [r0, #20]
    >   1c:   e5801018        str     r1, [r0, #24]
    >   20:   e580101c        str     r1, [r0, #28]
    >   24:   e5801000        str     r1, [r0]
    >   28:   e12fff1e        bx      lr
    > 0000002c <main>:
    >   2c:   e3a00000        mov     r0, #0
    >   30:   e12fff1e        bx      lr
    > 
    > 
    > Thanks,
    > 
    > Bharathi
    > 
    > On Fri, Sep 8, 2017 at 8:22 AM, Evgeny Astigeevich
    > <Evgeny.Astigeevich at arm.com> wrote:
    > > Hi Bharathi,
    > >
    > > From the discussion you provided I found that the issue happens for a big-
    > endian ARM target.
    > > For the little-endian target the intrinsic in your test case is lowered to store
    > instructions.
    > > Some debugging is needed to figure out why it's not happening for big-
    > endian.
    > >
    > > -Evgeny
    > >
    > >> -----Original Message-----
    > >> From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of
    > >> Evgeny Astigeevich via llvm-dev
    > >> Sent: Thursday, September 07, 2017 4:25 PM
    > >> To: Bharathi Seshadri
    > >> Cc: llvm-dev; nd
    > >> Subject: Re: [llvm-dev] Lowering llvm.memset for ARM target
    > >>
    > >> Hi Bharathi,
    > >>
    > >> MaxStoresPerMemset  was changed from 16 to 8 in r 169791. The commit
    > >> comment:
    > >>
    > >> "Some enhancements for memcpy / memset inline expansion.
    > >> 1. Teach it to use overlapping unaligned load / store to copy / set the
    > trailing
    > >>    bytes. e.g. On 86, use two pairs of movups / movaps for 17 - 31 byte
    > copies.
    > >> 2. Use f64 for memcpy / memset on targets where i64 is not legal but f64
    > is.
    > >> e.g.
    > >>    x86 and ARM.
    > >> 3. When memcpy from a constant string, do *not* replace the load with
    > >> a constant
    > >>    if it's not possible to materialize an integer immediate with a single
    > >>    instruction (required a new target hook: TLI.isIntImmLegal()).
    > >> 4. Use unaligned load / stores more aggressively if target hooks
    > >> indicates they
    > >>    are "fast".
    > >> 5. Update ARM target hooks to use unaligned load / stores. e.g.
    > >> vld1.8 / vst1.8.
    > >>    Also increase the threshold to something reasonable (8 for memset, 4
    > pairs
    > >>    for memcpy).
    > >>
    > >> This significantly improves Dhrystone, up to 50% on ARM iOS devices.
    > >>
    > >> rdar://12760078"
    > >>
    > >> It's strange. According to the comment the threshold was increased
    > >> but it is decreased. I think the code needs to be revisited and
    > benchmarked.
    > >> I'll do some benchmarking.
    > >>
    > >> Thanks,
    > >> Evgeny Astigeevich | Arm Compiler Optimization Team Lead
    > >>
    > >>
    > >> > -----Original Message-----
    > >> > From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf
    > >> > Of Bharathi Seshadri via llvm-dev
    > >> > Sent: Tuesday, September 05, 2017 8:24 PM
    > >> > To: llvm-dev at lists.llvm.org
    > >> > Subject: [llvm-dev] Lowering llvm.memset for ARM target
    > >> >
    > >> > As reported in an earlier thread
    > >> > (http://clang-developers.42468.n3.nabble.com/Disable-memset-
    > synthes
    > >> > is-
    > >> > tp4057810.html),
    > >> > we noticed in some cases that the llvm.memset intrinsic, if lowered
    > >> > to stores, could help with performance.
    > >> >
    > >> > Here's a test case: If LIMIT is > 8, I see that a call to memset is
    > >> > emitted for arm & aarch64, but not for x86 target.
    > >> >
    > >> > typedef struct {
    > >> >     int v0[100];
    > >> > } test;
    > >> > #define LIMIT 9
    > >> > void init(test *t)
    > >> > {
    > >> >     int i;
    > >> >     for (i = 0; i < LIMIT ; i++)
    > >> >       t->v0[i] = 0;
    > >> > }
    > >> > int main() {
    > >> > test t;
    > >> > init(&t);
    > >> > return 0;
    > >> > }
    > >> >
    > >> > Looking at the llvm sources, I see that there are two key target
    > >> > specific variables, MaxStoresPerMemset and
    > >> MaxStoresPerMemsetOptSize,
    > >> > that determine if the intrinsic llvm.memset can be lowered into
    > >> > store
    > >> operations.
    > >> > For ARM, these variables are set to 8 and 4 respectively.
    > >> >
    > >> > I do not know as to how the default values for these two variables
    > >> > are arrived at, but doubling these values (similar to that for the
    > >> > x86
    > >> > target) seems to help our case and we observe a 7% increase in
    > >> > performance of our networking application. We use -O3 and -flto and
    > >> > 32-bit
    > >> arm.
    > >> >
    > >> > I can prepare a patch and post for review if such a change, say
    > >> > under CodeGenOpt::Aggressive would be acceptable.
    > >> >
    > >> > Thanks,
    > >> > Bharathi
    > >> > _______________________________________________
    > >> > LLVM Developers mailing list
    > >> > llvm-dev at lists.llvm.org
    > >> > http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
    > >> _______________________________________________
    > >> LLVM Developers mailing list
    > >> llvm-dev at lists.llvm.org
    > >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev