[llvm-dev] Vectorizing structure reads, writes, etc on X86-64 AVX

Wed Nov 4 07:46:48 PST 2015

Hi Jay -

I see the slow, small accesses using an older clang [Apple LLVM version
7.0.0 (clang-700.1.76)], but this looks fixed on trunk. I made a change
that comes into play if you don't specify a particular CPU:
http://llvm.org/viewvc/llvm-project?view=revision&revision=245950

$ ./clang -O1 -mavx copy.c -S -o  -
...
    movslq    %edi, %rax
    movq    _spr_dynamic at GOTPCREL(%rip), %rcx
    movq    (%rcx), %rcx
    shlq    $5, %rax
    movslq    %esi, %rdx
    movq    _spr_static at GOTPCREL(%rip), %rsi
    movq    (%rsi), %rsi
    shlq    $5, %rdx
    vmovups    (%rsi,%rdx), %ymm0                  <--- 32-byte load
    vmovups    %ymm0, (%rcx,%rax)                 <--- 32-byte store
    popq    %rbp
    vzeroupper
    retq

On Wed, Nov 4, 2015 at 8:11 AM, Jay McCarthy <jay.mccarthy at gmail.com> wrote:

> Thanks, Hal.
>
> That code is very readable. Basically, the following has to be true
> - not a memset or memzero [check]
> - no implicit floats [check]
> - size greater than 16 [check, it's 32]
> - ! isUnalignedMem16Slow [check?]
> - int256, fp256, or sse2, or sse1 is around [check]
>
> That last condition is:
> - src & dst alignment is 0 or greater than 16
>
> I think this is true, because I'm reading from a giant array of these
> things, so the memory should be aligned to the object size. Assuming
> that's wrong, I added an explicit alignment attribute.
>
> I think part of the problem is that the memcpy that gets generated
> isn't for the structure, but for the structures bitcast into character
> arrays:
>
>   %17 = bitcast %struct.sprite* %9 to i8*
>   %18 = bitcast %struct.sprite* %16 to i8*
>   call void @llvm.memcpy.p0i8.p0i8.i64(i8* %17, i8* %18, i64 32, i32
> 4, i1 false)
>
> So even though the original struct pointers were aligned at 32, the
> byte arrays that are created lose that alignment information.
>
> If this is correct, would you recommend this as just an error that
> will be fixed with a little test case?
>
> BTW, Here's a tiny C program that demonstrates the "problem":
>
> typedef struct {
>   float dx; float dy;
>   float mx; float my;
>   float theta; float a;
>   short spr; short pal;
>   char layer;
>   char r; char g; char b;
> } sprite;
>
> sprite *spr_static;        // or array of [1024] // or add
> __attribute__ ((align_value(32)))
> sprite *spr_dynamic;   // or array of [1024] // or add __attribute__
> ((align_value(32)))
>
> void copy(int i, int j) {
>   spr_dynamic[i] = spr_static[j];
> }
>
> Thanks!
>
> Jay
>
> On Tue, Nov 3, 2015 at 1:33 PM, Hal Finkel <hfinkel at anl.gov> wrote:
> >
> >
> > ----- Original Message -----
> >> From: "Sanjay Patel via llvm-dev" <llvm-dev at lists.llvm.org>
> >> To: "Jay McCarthy" <jay.mccarthy at gmail.com>
> >> Cc: "llvm-dev" <llvm-dev at lists.llvm.org>
> >> Sent: Tuesday, November 3, 2015 12:30:51 PM
> >> Subject: Re: [llvm-dev] Vectorizing structure reads, writes,  etc on
> X86-64 AVX
> >>
> >> If the memcpy version isn't getting optimized into larger memory
> >> operations, that definitely sounds like a bug worth filing.
> >>
> >> Lowering of memcpy is affected by the size of the copy, alignments of
> >> the source and dest, and CPU target. You may be able to narrow down
> >> the problem by changing those parameters.
> >>
> >
> > The relevant target-specific logic is in
> X86TargetLowering::getOptimalMemOpType, looking at that might help in
> understanding what's going on.
> >
> >  -Hal
> >
> >>
> >> On Tue, Nov 3, 2015 at 11:01 AM, Jay McCarthy <
> >> jay.mccarthy at gmail.com > wrote:
> >>
> >>
> >> Thank you for your reply. FWIW, I wrote the .ll by hand after taking
> >> the C program, using clang to emit the llvm and seeing the memcpy.
> >> The
> >> memcpy version that clang generates gets compiled into assembly that
> >> uses the large sequence of movs and does not use the vector hardware
> >> at all. When I started debugging, I took that clang produced .ll and
> >> started to write it different ways trying to get different results.
> >>
> >> Jay
> >>
> >>
> >>
> >> On Tue, Nov 3, 2015 at 12:23 PM, Sanjay Patel <
> >> spatel at rotateright.com > wrote:
> >> > Hi Jay -
> >> >
> >> > I'm surprised by the codegen for your examples too, but LLVM has an
> >> > expectation that a front-end and IR optimizer will use llvm.memcpy
> >> > liberally:
> >> >
> http://llvm.org/docs/doxygen/html/SelectionDAGBuilder_8cpp_source.html#l00094
> >> >
> http://llvm.org/docs/doxygen/html/SelectionDAGBuilder_8cpp_source.html#l03156
> >> >
> >> > "Any ld-ld-st-st sequence over this should have been converted to
> >> > llvm.memcpy by the frontend."
> >> > "The optimizer should really avoid this case by converting large
> >> > object/array copies to llvm.memcpy"
> >> >
> >> >
> >> > So for example with clang:
> >> >
> >> > $ cat copy.c
> >> > struct bagobytes {
> >> > int i0;
> >> > int i1;
> >> > };
> >> >
> >> > void foo(struct bagobytes* a, struct bagobytes* b) {
> >> > *b = *a;
> >> > }
> >> >
> >> > $ clang -O2 copy.c -S -emit-llvm -Xclang -disable-llvm-optzns -o -
> >> > define void @foo(%struct.bagobytes* %a, %struct.bagobytes* %b) #0 {
> >> > ...
> >> > call void @llvm.memcpy.p0i8.p0i8.i64(i8* %2, i8* %3, i64 8, i32 4,
> >> > i1
> >> > false), !tbaa.struct !6
> >> > ret void
> >> > }
> >> >
> >> > It may still be worth filing a bug (or seeing if one is already
> >> > open) for
> >> > one of your simple examples.
> >> >
> >> >
> >> > On Thu, Oct 29, 2015 at 6:08 PM, Jay McCarthy via llvm-dev
> >> > < llvm-dev at lists.llvm.org > wrote:
> >> >>
> >> >> I am a first time poster, so I apologize if this is an obvious
> >> >> question or out of scope for LLVM. I am an LLVM user. I don't
> >> >> really
> >> >> know anything about hacking on LLVM, but I do know a bit about
> >> >> compilation generally.
> >> >>
> >> >> I am on x86-64 and I am interested in structure reads, writes, and
> >> >> constants being optimized to use vector registers when the
> >> >> alignment
> >> >> and sizes are right. I have created a gist of a small example:
> >> >>
> >> >> https://gist.github.com/jeapostrophe/d54d3a6a871e5127a6ed
> >> >>
> >> >> The assembly is produced with
> >> >>
> >> >> llc -O3 -march=x86-64 -mcpu=corei7-avx
> >> >>
> >> >> The key idea is that we have a structure like this:
> >> >>
> >> >> %athing = type { float, float, float, float, float, float, i16,
> >> >> i16,
> >> >> i8, i8, i8, i8 }
> >> >>
> >> >> That works out to be 32 bytes, so it can fit in YMM registers.
> >> >>
> >> >> If I have two pointers to arrays of these things:
> >> >>
> >> >> @one = external global %athing
> >> >> @two = external global %athing
> >> >>
> >> >> and then I do a copy from one to the other
> >> >>
> >> >> %a = load %athing* @two
> >> >> store %athing %a, %athing* @one
> >> >>
> >> >> Then the code that is generated uses the XMM registers for the
> >> >> floats,
> >> >> but does 12 loads and then 12 stores.
> >> >>
> >> >> In contrast, if I manually cast to a properly sized float vector I
> >> >> get
> >> >> the desired single load and single store:
> >> >>
> >> >> %two_vector = bitcast %athing* @two to <8 x float>*
> >> >> %b = load <8 x float>* %two_vector
> >> >> %one_vector = bitcast %athing* @one to <8 x float>*
> >> >> store <8 x float> %b, <8 x float>* %one_vector
> >> >>
> >> >> The rest of the file demonstrates that the code for modifying
> >> >> these
> >> >> vectors is pretty good, but has examples of bad ways to initialize
> >> >> the
> >> >> structure and a good way to initialize it. If I try to store a
> >> >> constant struct, I get 13 stores. If I try to assemble a vector by
> >> >> casting <2 x i16> to float then <4 x i8> to float and installing
> >> >> them
> >> >> into a single <8 x float>, I do get the desired single store, but
> >> >> I
> >> >> get very complicated constants that are loaded from memory. In
> >> >> contrast, if I bitcast the <8 x float> to <16 x i16> and <32 x i8>
> >> >> as
> >> >> I go, then I get the desired initialization with no loads and just
> >> >> modifications of the single YMM register. (Even this last one,
> >> >> however, doesn't have the best assembly because the words and
> >> >> bytes
> >> >> are not inserted into the vector simultaneously, but instead
> >> >> individually.)
> >> >>
> >> >> I am kind of surprised that the obvious code didn't get optimized
> >> >> the
> >> >> way I expected and even the tedious version of the initialization
> >> >> isn't optimal either. I would like to know if a transformation of
> >> >> one
> >> >> to the other is feasible in LLVM (I know anything is possible, but
> >> >> what is feasible in this situation?) or if I should implement a
> >> >> transformation like this in my front-end and settle for the
> >> >> initialization that comes out.
> >> >>
> >> >> Thank you for your time,
> >> >>
> >> >> Jay
> >> >>
> >> >> --
> >> >> Jay McCarthy
> >> >> Associate Professor
> >> >> PLT @ CS @ UMass Lowell
> >> >> http://jeapostrophe.github.io
> >> >>
> >> >> "Wherefore, be not weary in well-doing,
> >> >> for ye are laying the foundation of a great work.
> >> >> And out of small things proceedeth that which is great."
> >> >> - D&C 64:33
> >> >> _______________________________________________
> >> >> LLVM Developers mailing list
> >> >> llvm-dev at lists.llvm.org
> >> >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> >> >
> >> >
> >>
> >>
> >>
> >> --
> >> Jay McCarthy
> >> Associate Professor
> >> PLT @ CS @ UMass Lowell
> >> http://jeapostrophe.github.io
> >>
> >> "Wherefore, be not weary in well-doing,
> >> for ye are laying the foundation of a great work.
> >> And out of small things proceedeth that which is great."
> >> - D&C 64:33
> >>
> >>
> >> _______________________________________________
> >> LLVM Developers mailing list
> >> llvm-dev at lists.llvm.org
> >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> >>
> >
> > --
> > Hal Finkel
> > Assistant Computational Scientist
> > Leadership Computing Facility
> > Argonne National Laboratory
>
>
>
> --
> Jay McCarthy
> Associate Professor
> PLT @ CS @ UMass Lowell
> http://jeapostrophe.github.io
>
>            "Wherefore, be not weary in well-doing,
>       for ye are laying the foundation of a great work.
> And out of small things proceedeth that which is great."
>                           - D&C 64:33
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20151104/d17ce514/attachment.html>