<div dir="ltr">No problem. Please do file bugs if you see anything that looks suspicious.<br><br>The x86 memcpy lowering still has that FIXME comment that I haven't gotten back around to, and we have at least one other potential improvement:<br><a href="https://llvm.org/bugs/show_bug.cgi?id=24678">https://llvm.org/bugs/show_bug.cgi?id=24678</a><br></div><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Nov 4, 2015 at 8:53 AM, Jay McCarthy <span dir="ltr"><<a href="mailto:jay.mccarthy@gmail.com" target="_blank">jay.mccarthy@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Oh that's great. I'll just update and go from there. Thanks so much<br>

and sorry for the noise.<br>

<span class="HOEnZb"><font color="#888888"><br>

Jay<br>

</font></span><div class="HOEnZb"><div class="h5"><br>

On Wed, Nov 4, 2015 at 10:46 AM, Sanjay Patel <<a href="mailto:spatel@rotateright.com">spatel@rotateright.com</a>> wrote:<br>

> Hi Jay -<br>

><br>

> I see the slow, small accesses using an older clang [Apple LLVM version<br>

> 7.0.0 (clang-700.1.76)], but this looks fixed on trunk. I made a change that<br>

> comes into play if you don't specify a particular CPU:<br>

> <a href="http://llvm.org/viewvc/llvm-project?view=revision&revision=245950" rel="noreferrer" target="_blank">http://llvm.org/viewvc/llvm-project?view=revision&revision=245950</a><br>

><br>

> $ ./clang -O1 -mavx copy.c -S -o  -<br>

> ...<br>

>     movslq    %edi, %rax<br>

>     movq    _spr_dynamic@GOTPCREL(%rip), %rcx<br>

>     movq    (%rcx), %rcx<br>

>     shlq    $5, %rax<br>

>     movslq    %esi, %rdx<br>

>     movq    _spr_static@GOTPCREL(%rip), %rsi<br>

>     movq    (%rsi), %rsi<br>

>     shlq    $5, %rdx<br>

>     vmovups    (%rsi,%rdx), %ymm0                  <--- 32-byte load<br>

>     vmovups    %ymm0, (%rcx,%rax)                 <--- 32-byte store<br>

>     popq    %rbp<br>

>     vzeroupper<br>

>     retq<br>

><br>

><br>

><br>

> On Wed, Nov 4, 2015 at 8:11 AM, Jay McCarthy <<a href="mailto:jay.mccarthy@gmail.com">jay.mccarthy@gmail.com</a>> wrote:<br>

>><br>

>> Thanks, Hal.<br>

>><br>

>> That code is very readable. Basically, the following has to be true<br>

>> - not a memset or memzero [check]<br>

>> - no implicit floats [check]<br>

>> - size greater than 16 [check, it's 32]<br>

>> - ! isUnalignedMem16Slow [check?]<br>

>> - int256, fp256, or sse2, or sse1 is around [check]<br>

>><br>

>> That last condition is:<br>

>> - src & dst alignment is 0 or greater than 16<br>

>><br>

>> I think this is true, because I'm reading from a giant array of these<br>

>> things, so the memory should be aligned to the object size. Assuming<br>

>> that's wrong, I added an explicit alignment attribute.<br>

>><br>

>> I think part of the problem is that the memcpy that gets generated<br>

>> isn't for the structure, but for the structures bitcast into character<br>

>> arrays:<br>

>><br>

>>   %17 = bitcast %struct.sprite* %9 to i8*<br>

>>   %18 = bitcast %struct.sprite* %16 to i8*<br>

>>   call void @llvm.memcpy.p0i8.p0i8.i64(i8* %17, i8* %18, i64 32, i32<br>

>> 4, i1 false)<br>

>><br>

>> So even though the original struct pointers were aligned at 32, the<br>

>> byte arrays that are created lose that alignment information.<br>

>><br>

>> If this is correct, would you recommend this as just an error that<br>

>> will be fixed with a little test case?<br>

>><br>

>> BTW, Here's a tiny C program that demonstrates the "problem":<br>

>><br>

>> typedef struct {<br>

>>   float dx; float dy;<br>

>>   float mx; float my;<br>

>>   float theta; float a;<br>

>>   short spr; short pal;<br>

>>   char layer;<br>

>>   char r; char g; char b;<br>

>> } sprite;<br>

>><br>

>> sprite *spr_static;        // or array of [1024] // or add<br>

>> __attribute__ ((align_value(32)))<br>

>> sprite *spr_dynamic;   // or array of [1024] // or add __attribute__<br>

>> ((align_value(32)))<br>

>><br>

>> void copy(int i, int j) {<br>

>>   spr_dynamic[i] = spr_static[j];<br>

>> }<br>

>><br>

>> Thanks!<br>

>><br>

>> Jay<br>

>><br>

>> On Tue, Nov 3, 2015 at 1:33 PM, Hal Finkel <<a href="mailto:hfinkel@anl.gov">hfinkel@anl.gov</a>> wrote:<br>

>> ><br>

>> ><br>

>> > ----- Original Message -----<br>

>> >> From: "Sanjay Patel via llvm-dev" <<a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>><br>

>> >> To: "Jay McCarthy" <<a href="mailto:jay.mccarthy@gmail.com">jay.mccarthy@gmail.com</a>><br>

>> >> Cc: "llvm-dev" <<a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a>><br>

>> >> Sent: Tuesday, November 3, 2015 12:30:51 PM<br>

>> >> Subject: Re: [llvm-dev] Vectorizing structure reads, writes,  etc on<br>

>> >> X86-64 AVX<br>

>> >><br>

>> >> If the memcpy version isn't getting optimized into larger memory<br>

>> >> operations, that definitely sounds like a bug worth filing.<br>

>> >><br>

>> >> Lowering of memcpy is affected by the size of the copy, alignments of<br>

>> >> the source and dest, and CPU target. You may be able to narrow down<br>

>> >> the problem by changing those parameters.<br>

>> >><br>

>> ><br>

>> > The relevant target-specific logic is in<br>

>> > X86TargetLowering::getOptimalMemOpType, looking at that might help in<br>

>> > understanding what's going on.<br>

>> ><br>

>> >  -Hal<br>

>> ><br>

>> >><br>

>> >> On Tue, Nov 3, 2015 at 11:01 AM, Jay McCarthy <<br>

>> >> <a href="mailto:jay.mccarthy@gmail.com">jay.mccarthy@gmail.com</a> > wrote:<br>

>> >><br>

>> >><br>

>> >> Thank you for your reply. FWIW, I wrote the .ll by hand after taking<br>

>> >> the C program, using clang to emit the llvm and seeing the memcpy.<br>

>> >> The<br>

>> >> memcpy version that clang generates gets compiled into assembly that<br>

>> >> uses the large sequence of movs and does not use the vector hardware<br>

>> >> at all. When I started debugging, I took that clang produced .ll and<br>

>> >> started to write it different ways trying to get different results.<br>

>> >><br>

>> >> Jay<br>

>> >><br>

>> >><br>

>> >><br>

>> >> On Tue, Nov 3, 2015 at 12:23 PM, Sanjay Patel <<br>

>> >> <a href="mailto:spatel@rotateright.com">spatel@rotateright.com</a> > wrote:<br>

>> >> > Hi Jay -<br>

>> >> ><br>

>> >> > I'm surprised by the codegen for your examples too, but LLVM has an<br>

>> >> > expectation that a front-end and IR optimizer will use llvm.memcpy<br>

>> >> > liberally:<br>

>> >> ><br>

>> >> > <a href="http://llvm.org/docs/doxygen/html/SelectionDAGBuilder_8cpp_source.html#l00094" rel="noreferrer" target="_blank">http://llvm.org/docs/doxygen/html/SelectionDAGBuilder_8cpp_source.html#l00094</a><br>

>> >> ><br>

>> >> > <a href="http://llvm.org/docs/doxygen/html/SelectionDAGBuilder_8cpp_source.html#l03156" rel="noreferrer" target="_blank">http://llvm.org/docs/doxygen/html/SelectionDAGBuilder_8cpp_source.html#l03156</a><br>

>> >> ><br>

>> >> > "Any ld-ld-st-st sequence over this should have been converted to<br>

>> >> > llvm.memcpy by the frontend."<br>

>> >> > "The optimizer should really avoid this case by converting large<br>

>> >> > object/array copies to llvm.memcpy"<br>

>> >> ><br>

>> >> ><br>

>> >> > So for example with clang:<br>

>> >> ><br>

>> >> > $ cat copy.c<br>

>> >> > struct bagobytes {<br>

>> >> > int i0;<br>

>> >> > int i1;<br>

>> >> > };<br>

>> >> ><br>

>> >> > void foo(struct bagobytes* a, struct bagobytes* b) {<br>

>> >> > *b = *a;<br>

>> >> > }<br>

>> >> ><br>

>> >> > $ clang -O2 copy.c -S -emit-llvm -Xclang -disable-llvm-optzns -o -<br>

>> >> > define void @foo(%struct.bagobytes* %a, %struct.bagobytes* %b) #0 {<br>

>> >> > ...<br>

>> >> > call void @llvm.memcpy.p0i8.p0i8.i64(i8* %2, i8* %3, i64 8, i32 4,<br>

>> >> > i1<br>

>> >> > false), !tbaa.struct !6<br>

>> >> > ret void<br>

>> >> > }<br>

>> >> ><br>

>> >> > It may still be worth filing a bug (or seeing if one is already<br>

>> >> > open) for<br>

>> >> > one of your simple examples.<br>

>> >> ><br>

>> >> ><br>

>> >> > On Thu, Oct 29, 2015 at 6:08 PM, Jay McCarthy via llvm-dev<br>

>> >> > < <a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a> > wrote:<br>

>> >> >><br>

>> >> >> I am a first time poster, so I apologize if this is an obvious<br>

>> >> >> question or out of scope for LLVM. I am an LLVM user. I don't<br>

>> >> >> really<br>

>> >> >> know anything about hacking on LLVM, but I do know a bit about<br>

>> >> >> compilation generally.<br>

>> >> >><br>

>> >> >> I am on x86-64 and I am interested in structure reads, writes, and<br>

>> >> >> constants being optimized to use vector registers when the<br>

>> >> >> alignment<br>

>> >> >> and sizes are right. I have created a gist of a small example:<br>

>> >> >><br>

>> >> >> <a href="https://gist.github.com/jeapostrophe/d54d3a6a871e5127a6ed" rel="noreferrer" target="_blank">https://gist.github.com/jeapostrophe/d54d3a6a871e5127a6ed</a><br>

>> >> >><br>

>> >> >> The assembly is produced with<br>

>> >> >><br>

>> >> >> llc -O3 -march=x86-64 -mcpu=corei7-avx<br>

>> >> >><br>

>> >> >> The key idea is that we have a structure like this:<br>

>> >> >><br>

>> >> >> %athing = type { float, float, float, float, float, float, i16,<br>

>> >> >> i16,<br>

>> >> >> i8, i8, i8, i8 }<br>

>> >> >><br>

>> >> >> That works out to be 32 bytes, so it can fit in YMM registers.<br>

>> >> >><br>

>> >> >> If I have two pointers to arrays of these things:<br>

>> >> >><br>

>> >> >> @one = external global %athing<br>

>> >> >> @two = external global %athing<br>

>> >> >><br>

>> >> >> and then I do a copy from one to the other<br>

>> >> >><br>

>> >> >> %a = load %athing* @two<br>

>> >> >> store %athing %a, %athing* @one<br>

>> >> >><br>

>> >> >> Then the code that is generated uses the XMM registers for the<br>

>> >> >> floats,<br>

>> >> >> but does 12 loads and then 12 stores.<br>

>> >> >><br>

>> >> >> In contrast, if I manually cast to a properly sized float vector I<br>

>> >> >> get<br>

>> >> >> the desired single load and single store:<br>

>> >> >><br>

>> >> >> %two_vector = bitcast %athing* @two to <8 x float>*<br>

>> >> >> %b = load <8 x float>* %two_vector<br>

>> >> >> %one_vector = bitcast %athing* @one to <8 x float>*<br>

>> >> >> store <8 x float> %b, <8 x float>* %one_vector<br>

>> >> >><br>

>> >> >> The rest of the file demonstrates that the code for modifying<br>

>> >> >> these<br>

>> >> >> vectors is pretty good, but has examples of bad ways to initialize<br>

>> >> >> the<br>

>> >> >> structure and a good way to initialize it. If I try to store a<br>

>> >> >> constant struct, I get 13 stores. If I try to assemble a vector by<br>

>> >> >> casting <2 x i16> to float then <4 x i8> to float and installing<br>

>> >> >> them<br>

>> >> >> into a single <8 x float>, I do get the desired single store, but<br>

>> >> >> I<br>

>> >> >> get very complicated constants that are loaded from memory. In<br>

>> >> >> contrast, if I bitcast the <8 x float> to <16 x i16> and <32 x i8><br>

>> >> >> as<br>

>> >> >> I go, then I get the desired initialization with no loads and just<br>

>> >> >> modifications of the single YMM register. (Even this last one,<br>

>> >> >> however, doesn't have the best assembly because the words and<br>

>> >> >> bytes<br>

>> >> >> are not inserted into the vector simultaneously, but instead<br>

>> >> >> individually.)<br>

>> >> >><br>

>> >> >> I am kind of surprised that the obvious code didn't get optimized<br>

>> >> >> the<br>

>> >> >> way I expected and even the tedious version of the initialization<br>

>> >> >> isn't optimal either. I would like to know if a transformation of<br>

>> >> >> one<br>

>> >> >> to the other is feasible in LLVM (I know anything is possible, but<br>

>> >> >> what is feasible in this situation?) or if I should implement a<br>

>> >> >> transformation like this in my front-end and settle for the<br>

>> >> >> initialization that comes out.<br>

>> >> >><br>

>> >> >> Thank you for your time,<br>

>> >> >><br>

>> >> >> Jay<br>

>> >> >><br>

>> >> >> --<br>

>> >> >> Jay McCarthy<br>

>> >> >> Associate Professor<br>

>> >> >> PLT @ CS @ UMass Lowell<br>

>> >> >> <a href="http://jeapostrophe.github.io" rel="noreferrer" target="_blank">http://jeapostrophe.github.io</a><br>

>> >> >><br>

>> >> >> "Wherefore, be not weary in well-doing,<br>

>> >> >> for ye are laying the foundation of a great work.<br>

>> >> >> And out of small things proceedeth that which is great."<br>

>> >> >> - D&C 64:33<br>

>> >> >> _______________________________________________<br>

>> >> >> LLVM Developers mailing list<br>

>> >> >> <a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a><br>

>> >> >> <a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><br>

>> >> ><br>

>> >> ><br>

>> >><br>

>> >><br>

>> >><br>

>> >> --<br>

>> >> Jay McCarthy<br>

>> >> Associate Professor<br>

>> >> PLT @ CS @ UMass Lowell<br>

>> >> <a href="http://jeapostrophe.github.io" rel="noreferrer" target="_blank">http://jeapostrophe.github.io</a><br>

>> >><br>

>> >> "Wherefore, be not weary in well-doing,<br>

>> >> for ye are laying the foundation of a great work.<br>

>> >> And out of small things proceedeth that which is great."<br>

>> >> - D&C 64:33<br>

>> >><br>

>> >><br>

>> >> _______________________________________________<br>

>> >> LLVM Developers mailing list<br>

>> >> <a href="mailto:llvm-dev@lists.llvm.org">llvm-dev@lists.llvm.org</a><br>

>> >> <a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><br>

>> >><br>

>> ><br>

>> > --<br>

>> > Hal Finkel<br>

>> > Assistant Computational Scientist<br>

>> > Leadership Computing Facility<br>

>> > Argonne National Laboratory<br>

>><br>

>><br>

>><br>

>> --<br>

>> Jay McCarthy<br>

>> Associate Professor<br>

>> PLT @ CS @ UMass Lowell<br>

>> <a href="http://jeapostrophe.github.io" rel="noreferrer" target="_blank">http://jeapostrophe.github.io</a><br>

>><br>

>>            "Wherefore, be not weary in well-doing,<br>

>>       for ye are laying the foundation of a great work.<br>

>> And out of small things proceedeth that which is great."<br>

>>                           - D&C 64:33<br>

><br>

><br>

<br>

<br>

<br>

--<br>

Jay McCarthy<br>

Associate Professor<br>

PLT @ CS @ UMass Lowell<br>

<a href="http://jeapostrophe.github.io" rel="noreferrer" target="_blank">http://jeapostrophe.github.io</a><br>

<br>

           "Wherefore, be not weary in well-doing,<br>

      for ye are laying the foundation of a great work.<br>

And out of small things proceedeth that which is great."<br>

                          - D&C 64:33<br>

</div></div></blockquote></div><br></div>