[llvm-dev] Vectorizing structure reads, writes, etc on X86-64 AVX
Sanjay Patel via llvm-dev
llvm-dev at lists.llvm.org
Wed Nov 4 08:07:30 PST 2015
No problem. Please do file bugs if you see anything that looks suspicious.
The x86 memcpy lowering still has that FIXME comment that I haven't gotten
back around to, and we have at least one other potential improvement:
https://llvm.org/bugs/show_bug.cgi?id=24678
On Wed, Nov 4, 2015 at 8:53 AM, Jay McCarthy <jay.mccarthy at gmail.com> wrote:
> Oh that's great. I'll just update and go from there. Thanks so much
> and sorry for the noise.
>
> Jay
>
> On Wed, Nov 4, 2015 at 10:46 AM, Sanjay Patel <spatel at rotateright.com>
> wrote:
> > Hi Jay -
> >
> > I see the slow, small accesses using an older clang [Apple LLVM version
> > 7.0.0 (clang-700.1.76)], but this looks fixed on trunk. I made a change
> that
> > comes into play if you don't specify a particular CPU:
> > http://llvm.org/viewvc/llvm-project?view=revision&revision=245950
> >
> > $ ./clang -O1 -mavx copy.c -S -o -
> > ...
> > movslq %edi, %rax
> > movq _spr_dynamic at GOTPCREL(%rip), %rcx
> > movq (%rcx), %rcx
> > shlq $5, %rax
> > movslq %esi, %rdx
> > movq _spr_static at GOTPCREL(%rip), %rsi
> > movq (%rsi), %rsi
> > shlq $5, %rdx
> > vmovups (%rsi,%rdx), %ymm0 <--- 32-byte load
> > vmovups %ymm0, (%rcx,%rax) <--- 32-byte store
> > popq %rbp
> > vzeroupper
> > retq
> >
> >
> >
> > On Wed, Nov 4, 2015 at 8:11 AM, Jay McCarthy <jay.mccarthy at gmail.com>
> wrote:
> >>
> >> Thanks, Hal.
> >>
> >> That code is very readable. Basically, the following has to be true
> >> - not a memset or memzero [check]
> >> - no implicit floats [check]
> >> - size greater than 16 [check, it's 32]
> >> - ! isUnalignedMem16Slow [check?]
> >> - int256, fp256, or sse2, or sse1 is around [check]
> >>
> >> That last condition is:
> >> - src & dst alignment is 0 or greater than 16
> >>
> >> I think this is true, because I'm reading from a giant array of these
> >> things, so the memory should be aligned to the object size. Assuming
> >> that's wrong, I added an explicit alignment attribute.
> >>
> >> I think part of the problem is that the memcpy that gets generated
> >> isn't for the structure, but for the structures bitcast into character
> >> arrays:
> >>
> >> %17 = bitcast %struct.sprite* %9 to i8*
> >> %18 = bitcast %struct.sprite* %16 to i8*
> >> call void @llvm.memcpy.p0i8.p0i8.i64(i8* %17, i8* %18, i64 32, i32
> >> 4, i1 false)
> >>
> >> So even though the original struct pointers were aligned at 32, the
> >> byte arrays that are created lose that alignment information.
> >>
> >> If this is correct, would you recommend this as just an error that
> >> will be fixed with a little test case?
> >>
> >> BTW, Here's a tiny C program that demonstrates the "problem":
> >>
> >> typedef struct {
> >> float dx; float dy;
> >> float mx; float my;
> >> float theta; float a;
> >> short spr; short pal;
> >> char layer;
> >> char r; char g; char b;
> >> } sprite;
> >>
> >> sprite *spr_static; // or array of [1024] // or add
> >> __attribute__ ((align_value(32)))
> >> sprite *spr_dynamic; // or array of [1024] // or add __attribute__
> >> ((align_value(32)))
> >>
> >> void copy(int i, int j) {
> >> spr_dynamic[i] = spr_static[j];
> >> }
> >>
> >> Thanks!
> >>
> >> Jay
> >>
> >> On Tue, Nov 3, 2015 at 1:33 PM, Hal Finkel <hfinkel at anl.gov> wrote:
> >> >
> >> >
> >> > ----- Original Message -----
> >> >> From: "Sanjay Patel via llvm-dev" <llvm-dev at lists.llvm.org>
> >> >> To: "Jay McCarthy" <jay.mccarthy at gmail.com>
> >> >> Cc: "llvm-dev" <llvm-dev at lists.llvm.org>
> >> >> Sent: Tuesday, November 3, 2015 12:30:51 PM
> >> >> Subject: Re: [llvm-dev] Vectorizing structure reads, writes, etc on
> >> >> X86-64 AVX
> >> >>
> >> >> If the memcpy version isn't getting optimized into larger memory
> >> >> operations, that definitely sounds like a bug worth filing.
> >> >>
> >> >> Lowering of memcpy is affected by the size of the copy, alignments of
> >> >> the source and dest, and CPU target. You may be able to narrow down
> >> >> the problem by changing those parameters.
> >> >>
> >> >
> >> > The relevant target-specific logic is in
> >> > X86TargetLowering::getOptimalMemOpType, looking at that might help in
> >> > understanding what's going on.
> >> >
> >> > -Hal
> >> >
> >> >>
> >> >> On Tue, Nov 3, 2015 at 11:01 AM, Jay McCarthy <
> >> >> jay.mccarthy at gmail.com > wrote:
> >> >>
> >> >>
> >> >> Thank you for your reply. FWIW, I wrote the .ll by hand after taking
> >> >> the C program, using clang to emit the llvm and seeing the memcpy.
> >> >> The
> >> >> memcpy version that clang generates gets compiled into assembly that
> >> >> uses the large sequence of movs and does not use the vector hardware
> >> >> at all. When I started debugging, I took that clang produced .ll and
> >> >> started to write it different ways trying to get different results.
> >> >>
> >> >> Jay
> >> >>
> >> >>
> >> >>
> >> >> On Tue, Nov 3, 2015 at 12:23 PM, Sanjay Patel <
> >> >> spatel at rotateright.com > wrote:
> >> >> > Hi Jay -
> >> >> >
> >> >> > I'm surprised by the codegen for your examples too, but LLVM has an
> >> >> > expectation that a front-end and IR optimizer will use llvm.memcpy
> >> >> > liberally:
> >> >> >
> >> >> >
> http://llvm.org/docs/doxygen/html/SelectionDAGBuilder_8cpp_source.html#l00094
> >> >> >
> >> >> >
> http://llvm.org/docs/doxygen/html/SelectionDAGBuilder_8cpp_source.html#l03156
> >> >> >
> >> >> > "Any ld-ld-st-st sequence over this should have been converted to
> >> >> > llvm.memcpy by the frontend."
> >> >> > "The optimizer should really avoid this case by converting large
> >> >> > object/array copies to llvm.memcpy"
> >> >> >
> >> >> >
> >> >> > So for example with clang:
> >> >> >
> >> >> > $ cat copy.c
> >> >> > struct bagobytes {
> >> >> > int i0;
> >> >> > int i1;
> >> >> > };
> >> >> >
> >> >> > void foo(struct bagobytes* a, struct bagobytes* b) {
> >> >> > *b = *a;
> >> >> > }
> >> >> >
> >> >> > $ clang -O2 copy.c -S -emit-llvm -Xclang -disable-llvm-optzns -o -
> >> >> > define void @foo(%struct.bagobytes* %a, %struct.bagobytes* %b) #0 {
> >> >> > ...
> >> >> > call void @llvm.memcpy.p0i8.p0i8.i64(i8* %2, i8* %3, i64 8, i32 4,
> >> >> > i1
> >> >> > false), !tbaa.struct !6
> >> >> > ret void
> >> >> > }
> >> >> >
> >> >> > It may still be worth filing a bug (or seeing if one is already
> >> >> > open) for
> >> >> > one of your simple examples.
> >> >> >
> >> >> >
> >> >> > On Thu, Oct 29, 2015 at 6:08 PM, Jay McCarthy via llvm-dev
> >> >> > < llvm-dev at lists.llvm.org > wrote:
> >> >> >>
> >> >> >> I am a first time poster, so I apologize if this is an obvious
> >> >> >> question or out of scope for LLVM. I am an LLVM user. I don't
> >> >> >> really
> >> >> >> know anything about hacking on LLVM, but I do know a bit about
> >> >> >> compilation generally.
> >> >> >>
> >> >> >> I am on x86-64 and I am interested in structure reads, writes, and
> >> >> >> constants being optimized to use vector registers when the
> >> >> >> alignment
> >> >> >> and sizes are right. I have created a gist of a small example:
> >> >> >>
> >> >> >> https://gist.github.com/jeapostrophe/d54d3a6a871e5127a6ed
> >> >> >>
> >> >> >> The assembly is produced with
> >> >> >>
> >> >> >> llc -O3 -march=x86-64 -mcpu=corei7-avx
> >> >> >>
> >> >> >> The key idea is that we have a structure like this:
> >> >> >>
> >> >> >> %athing = type { float, float, float, float, float, float, i16,
> >> >> >> i16,
> >> >> >> i8, i8, i8, i8 }
> >> >> >>
> >> >> >> That works out to be 32 bytes, so it can fit in YMM registers.
> >> >> >>
> >> >> >> If I have two pointers to arrays of these things:
> >> >> >>
> >> >> >> @one = external global %athing
> >> >> >> @two = external global %athing
> >> >> >>
> >> >> >> and then I do a copy from one to the other
> >> >> >>
> >> >> >> %a = load %athing* @two
> >> >> >> store %athing %a, %athing* @one
> >> >> >>
> >> >> >> Then the code that is generated uses the XMM registers for the
> >> >> >> floats,
> >> >> >> but does 12 loads and then 12 stores.
> >> >> >>
> >> >> >> In contrast, if I manually cast to a properly sized float vector I
> >> >> >> get
> >> >> >> the desired single load and single store:
> >> >> >>
> >> >> >> %two_vector = bitcast %athing* @two to <8 x float>*
> >> >> >> %b = load <8 x float>* %two_vector
> >> >> >> %one_vector = bitcast %athing* @one to <8 x float>*
> >> >> >> store <8 x float> %b, <8 x float>* %one_vector
> >> >> >>
> >> >> >> The rest of the file demonstrates that the code for modifying
> >> >> >> these
> >> >> >> vectors is pretty good, but has examples of bad ways to initialize
> >> >> >> the
> >> >> >> structure and a good way to initialize it. If I try to store a
> >> >> >> constant struct, I get 13 stores. If I try to assemble a vector by
> >> >> >> casting <2 x i16> to float then <4 x i8> to float and installing
> >> >> >> them
> >> >> >> into a single <8 x float>, I do get the desired single store, but
> >> >> >> I
> >> >> >> get very complicated constants that are loaded from memory. In
> >> >> >> contrast, if I bitcast the <8 x float> to <16 x i16> and <32 x i8>
> >> >> >> as
> >> >> >> I go, then I get the desired initialization with no loads and just
> >> >> >> modifications of the single YMM register. (Even this last one,
> >> >> >> however, doesn't have the best assembly because the words and
> >> >> >> bytes
> >> >> >> are not inserted into the vector simultaneously, but instead
> >> >> >> individually.)
> >> >> >>
> >> >> >> I am kind of surprised that the obvious code didn't get optimized
> >> >> >> the
> >> >> >> way I expected and even the tedious version of the initialization
> >> >> >> isn't optimal either. I would like to know if a transformation of
> >> >> >> one
> >> >> >> to the other is feasible in LLVM (I know anything is possible, but
> >> >> >> what is feasible in this situation?) or if I should implement a
> >> >> >> transformation like this in my front-end and settle for the
> >> >> >> initialization that comes out.
> >> >> >>
> >> >> >> Thank you for your time,
> >> >> >>
> >> >> >> Jay
> >> >> >>
> >> >> >> --
> >> >> >> Jay McCarthy
> >> >> >> Associate Professor
> >> >> >> PLT @ CS @ UMass Lowell
> >> >> >> http://jeapostrophe.github.io
> >> >> >>
> >> >> >> "Wherefore, be not weary in well-doing,
> >> >> >> for ye are laying the foundation of a great work.
> >> >> >> And out of small things proceedeth that which is great."
> >> >> >> - D&C 64:33
> >> >> >> _______________________________________________
> >> >> >> LLVM Developers mailing list
> >> >> >> llvm-dev at lists.llvm.org
> >> >> >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> >> >> >
> >> >> >
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Jay McCarthy
> >> >> Associate Professor
> >> >> PLT @ CS @ UMass Lowell
> >> >> http://jeapostrophe.github.io
> >> >>
> >> >> "Wherefore, be not weary in well-doing,
> >> >> for ye are laying the foundation of a great work.
> >> >> And out of small things proceedeth that which is great."
> >> >> - D&C 64:33
> >> >>
> >> >>
> >> >> _______________________________________________
> >> >> LLVM Developers mailing list
> >> >> llvm-dev at lists.llvm.org
> >> >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> >> >>
> >> >
> >> > --
> >> > Hal Finkel
> >> > Assistant Computational Scientist
> >> > Leadership Computing Facility
> >> > Argonne National Laboratory
> >>
> >>
> >>
> >> --
> >> Jay McCarthy
> >> Associate Professor
> >> PLT @ CS @ UMass Lowell
> >> http://jeapostrophe.github.io
> >>
> >> "Wherefore, be not weary in well-doing,
> >> for ye are laying the foundation of a great work.
> >> And out of small things proceedeth that which is great."
> >> - D&C 64:33
> >
> >
>
>
>
> --
> Jay McCarthy
> Associate Professor
> PLT @ CS @ UMass Lowell
> http://jeapostrophe.github.io
>
> "Wherefore, be not weary in well-doing,
> for ye are laying the foundation of a great work.
> And out of small things proceedeth that which is great."
> - D&C 64:33
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20151104/e2874eba/attachment.html>
More information about the llvm-dev
mailing list