[llvm-dev] Vectorizing structure reads, writes, etc on X86-64 AVX

Jay McCarthy via llvm-dev llvm-dev at lists.llvm.org
Wed Nov 4 07:53:30 PST 2015


Oh that's great. I'll just update and go from there. Thanks so much
and sorry for the noise.

Jay

On Wed, Nov 4, 2015 at 10:46 AM, Sanjay Patel <spatel at rotateright.com> wrote:
> Hi Jay -
>
> I see the slow, small accesses using an older clang [Apple LLVM version
> 7.0.0 (clang-700.1.76)], but this looks fixed on trunk. I made a change that
> comes into play if you don't specify a particular CPU:
> http://llvm.org/viewvc/llvm-project?view=revision&revision=245950
>
> $ ./clang -O1 -mavx copy.c -S -o  -
> ...
>     movslq    %edi, %rax
>     movq    _spr_dynamic at GOTPCREL(%rip), %rcx
>     movq    (%rcx), %rcx
>     shlq    $5, %rax
>     movslq    %esi, %rdx
>     movq    _spr_static at GOTPCREL(%rip), %rsi
>     movq    (%rsi), %rsi
>     shlq    $5, %rdx
>     vmovups    (%rsi,%rdx), %ymm0                  <--- 32-byte load
>     vmovups    %ymm0, (%rcx,%rax)                 <--- 32-byte store
>     popq    %rbp
>     vzeroupper
>     retq
>
>
>
> On Wed, Nov 4, 2015 at 8:11 AM, Jay McCarthy <jay.mccarthy at gmail.com> wrote:
>>
>> Thanks, Hal.
>>
>> That code is very readable. Basically, the following has to be true
>> - not a memset or memzero [check]
>> - no implicit floats [check]
>> - size greater than 16 [check, it's 32]
>> - ! isUnalignedMem16Slow [check?]
>> - int256, fp256, or sse2, or sse1 is around [check]
>>
>> That last condition is:
>> - src & dst alignment is 0 or greater than 16
>>
>> I think this is true, because I'm reading from a giant array of these
>> things, so the memory should be aligned to the object size. Assuming
>> that's wrong, I added an explicit alignment attribute.
>>
>> I think part of the problem is that the memcpy that gets generated
>> isn't for the structure, but for the structures bitcast into character
>> arrays:
>>
>>   %17 = bitcast %struct.sprite* %9 to i8*
>>   %18 = bitcast %struct.sprite* %16 to i8*
>>   call void @llvm.memcpy.p0i8.p0i8.i64(i8* %17, i8* %18, i64 32, i32
>> 4, i1 false)
>>
>> So even though the original struct pointers were aligned at 32, the
>> byte arrays that are created lose that alignment information.
>>
>> If this is correct, would you recommend this as just an error that
>> will be fixed with a little test case?
>>
>> BTW, Here's a tiny C program that demonstrates the "problem":
>>
>> typedef struct {
>>   float dx; float dy;
>>   float mx; float my;
>>   float theta; float a;
>>   short spr; short pal;
>>   char layer;
>>   char r; char g; char b;
>> } sprite;
>>
>> sprite *spr_static;        // or array of [1024] // or add
>> __attribute__ ((align_value(32)))
>> sprite *spr_dynamic;   // or array of [1024] // or add __attribute__
>> ((align_value(32)))
>>
>> void copy(int i, int j) {
>>   spr_dynamic[i] = spr_static[j];
>> }
>>
>> Thanks!
>>
>> Jay
>>
>> On Tue, Nov 3, 2015 at 1:33 PM, Hal Finkel <hfinkel at anl.gov> wrote:
>> >
>> >
>> > ----- Original Message -----
>> >> From: "Sanjay Patel via llvm-dev" <llvm-dev at lists.llvm.org>
>> >> To: "Jay McCarthy" <jay.mccarthy at gmail.com>
>> >> Cc: "llvm-dev" <llvm-dev at lists.llvm.org>
>> >> Sent: Tuesday, November 3, 2015 12:30:51 PM
>> >> Subject: Re: [llvm-dev] Vectorizing structure reads, writes,  etc on
>> >> X86-64 AVX
>> >>
>> >> If the memcpy version isn't getting optimized into larger memory
>> >> operations, that definitely sounds like a bug worth filing.
>> >>
>> >> Lowering of memcpy is affected by the size of the copy, alignments of
>> >> the source and dest, and CPU target. You may be able to narrow down
>> >> the problem by changing those parameters.
>> >>
>> >
>> > The relevant target-specific logic is in
>> > X86TargetLowering::getOptimalMemOpType, looking at that might help in
>> > understanding what's going on.
>> >
>> >  -Hal
>> >
>> >>
>> >> On Tue, Nov 3, 2015 at 11:01 AM, Jay McCarthy <
>> >> jay.mccarthy at gmail.com > wrote:
>> >>
>> >>
>> >> Thank you for your reply. FWIW, I wrote the .ll by hand after taking
>> >> the C program, using clang to emit the llvm and seeing the memcpy.
>> >> The
>> >> memcpy version that clang generates gets compiled into assembly that
>> >> uses the large sequence of movs and does not use the vector hardware
>> >> at all. When I started debugging, I took that clang produced .ll and
>> >> started to write it different ways trying to get different results.
>> >>
>> >> Jay
>> >>
>> >>
>> >>
>> >> On Tue, Nov 3, 2015 at 12:23 PM, Sanjay Patel <
>> >> spatel at rotateright.com > wrote:
>> >> > Hi Jay -
>> >> >
>> >> > I'm surprised by the codegen for your examples too, but LLVM has an
>> >> > expectation that a front-end and IR optimizer will use llvm.memcpy
>> >> > liberally:
>> >> >
>> >> > http://llvm.org/docs/doxygen/html/SelectionDAGBuilder_8cpp_source.html#l00094
>> >> >
>> >> > http://llvm.org/docs/doxygen/html/SelectionDAGBuilder_8cpp_source.html#l03156
>> >> >
>> >> > "Any ld-ld-st-st sequence over this should have been converted to
>> >> > llvm.memcpy by the frontend."
>> >> > "The optimizer should really avoid this case by converting large
>> >> > object/array copies to llvm.memcpy"
>> >> >
>> >> >
>> >> > So for example with clang:
>> >> >
>> >> > $ cat copy.c
>> >> > struct bagobytes {
>> >> > int i0;
>> >> > int i1;
>> >> > };
>> >> >
>> >> > void foo(struct bagobytes* a, struct bagobytes* b) {
>> >> > *b = *a;
>> >> > }
>> >> >
>> >> > $ clang -O2 copy.c -S -emit-llvm -Xclang -disable-llvm-optzns -o -
>> >> > define void @foo(%struct.bagobytes* %a, %struct.bagobytes* %b) #0 {
>> >> > ...
>> >> > call void @llvm.memcpy.p0i8.p0i8.i64(i8* %2, i8* %3, i64 8, i32 4,
>> >> > i1
>> >> > false), !tbaa.struct !6
>> >> > ret void
>> >> > }
>> >> >
>> >> > It may still be worth filing a bug (or seeing if one is already
>> >> > open) for
>> >> > one of your simple examples.
>> >> >
>> >> >
>> >> > On Thu, Oct 29, 2015 at 6:08 PM, Jay McCarthy via llvm-dev
>> >> > < llvm-dev at lists.llvm.org > wrote:
>> >> >>
>> >> >> I am a first time poster, so I apologize if this is an obvious
>> >> >> question or out of scope for LLVM. I am an LLVM user. I don't
>> >> >> really
>> >> >> know anything about hacking on LLVM, but I do know a bit about
>> >> >> compilation generally.
>> >> >>
>> >> >> I am on x86-64 and I am interested in structure reads, writes, and
>> >> >> constants being optimized to use vector registers when the
>> >> >> alignment
>> >> >> and sizes are right. I have created a gist of a small example:
>> >> >>
>> >> >> https://gist.github.com/jeapostrophe/d54d3a6a871e5127a6ed
>> >> >>
>> >> >> The assembly is produced with
>> >> >>
>> >> >> llc -O3 -march=x86-64 -mcpu=corei7-avx
>> >> >>
>> >> >> The key idea is that we have a structure like this:
>> >> >>
>> >> >> %athing = type { float, float, float, float, float, float, i16,
>> >> >> i16,
>> >> >> i8, i8, i8, i8 }
>> >> >>
>> >> >> That works out to be 32 bytes, so it can fit in YMM registers.
>> >> >>
>> >> >> If I have two pointers to arrays of these things:
>> >> >>
>> >> >> @one = external global %athing
>> >> >> @two = external global %athing
>> >> >>
>> >> >> and then I do a copy from one to the other
>> >> >>
>> >> >> %a = load %athing* @two
>> >> >> store %athing %a, %athing* @one
>> >> >>
>> >> >> Then the code that is generated uses the XMM registers for the
>> >> >> floats,
>> >> >> but does 12 loads and then 12 stores.
>> >> >>
>> >> >> In contrast, if I manually cast to a properly sized float vector I
>> >> >> get
>> >> >> the desired single load and single store:
>> >> >>
>> >> >> %two_vector = bitcast %athing* @two to <8 x float>*
>> >> >> %b = load <8 x float>* %two_vector
>> >> >> %one_vector = bitcast %athing* @one to <8 x float>*
>> >> >> store <8 x float> %b, <8 x float>* %one_vector
>> >> >>
>> >> >> The rest of the file demonstrates that the code for modifying
>> >> >> these
>> >> >> vectors is pretty good, but has examples of bad ways to initialize
>> >> >> the
>> >> >> structure and a good way to initialize it. If I try to store a
>> >> >> constant struct, I get 13 stores. If I try to assemble a vector by
>> >> >> casting <2 x i16> to float then <4 x i8> to float and installing
>> >> >> them
>> >> >> into a single <8 x float>, I do get the desired single store, but
>> >> >> I
>> >> >> get very complicated constants that are loaded from memory. In
>> >> >> contrast, if I bitcast the <8 x float> to <16 x i16> and <32 x i8>
>> >> >> as
>> >> >> I go, then I get the desired initialization with no loads and just
>> >> >> modifications of the single YMM register. (Even this last one,
>> >> >> however, doesn't have the best assembly because the words and
>> >> >> bytes
>> >> >> are not inserted into the vector simultaneously, but instead
>> >> >> individually.)
>> >> >>
>> >> >> I am kind of surprised that the obvious code didn't get optimized
>> >> >> the
>> >> >> way I expected and even the tedious version of the initialization
>> >> >> isn't optimal either. I would like to know if a transformation of
>> >> >> one
>> >> >> to the other is feasible in LLVM (I know anything is possible, but
>> >> >> what is feasible in this situation?) or if I should implement a
>> >> >> transformation like this in my front-end and settle for the
>> >> >> initialization that comes out.
>> >> >>
>> >> >> Thank you for your time,
>> >> >>
>> >> >> Jay
>> >> >>
>> >> >> --
>> >> >> Jay McCarthy
>> >> >> Associate Professor
>> >> >> PLT @ CS @ UMass Lowell
>> >> >> http://jeapostrophe.github.io
>> >> >>
>> >> >> "Wherefore, be not weary in well-doing,
>> >> >> for ye are laying the foundation of a great work.
>> >> >> And out of small things proceedeth that which is great."
>> >> >> - D&C 64:33
>> >> >> _______________________________________________
>> >> >> LLVM Developers mailing list
>> >> >> llvm-dev at lists.llvm.org
>> >> >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>> >> >
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Jay McCarthy
>> >> Associate Professor
>> >> PLT @ CS @ UMass Lowell
>> >> http://jeapostrophe.github.io
>> >>
>> >> "Wherefore, be not weary in well-doing,
>> >> for ye are laying the foundation of a great work.
>> >> And out of small things proceedeth that which is great."
>> >> - D&C 64:33
>> >>
>> >>
>> >> _______________________________________________
>> >> LLVM Developers mailing list
>> >> llvm-dev at lists.llvm.org
>> >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>> >>
>> >
>> > --
>> > Hal Finkel
>> > Assistant Computational Scientist
>> > Leadership Computing Facility
>> > Argonne National Laboratory
>>
>>
>>
>> --
>> Jay McCarthy
>> Associate Professor
>> PLT @ CS @ UMass Lowell
>> http://jeapostrophe.github.io
>>
>>            "Wherefore, be not weary in well-doing,
>>       for ye are laying the foundation of a great work.
>> And out of small things proceedeth that which is great."
>>                           - D&C 64:33
>
>



-- 
Jay McCarthy
Associate Professor
PLT @ CS @ UMass Lowell
http://jeapostrophe.github.io

           "Wherefore, be not weary in well-doing,
      for ye are laying the foundation of a great work.
And out of small things proceedeth that which is great."
                          - D&C 64:33


More information about the llvm-dev mailing list