[llvm-dev] Vectorizing structure reads, writes, etc on X86-64 AVX

Tue Nov 3 10:33:06 PST 2015

----- Original Message -----
> From: "Sanjay Patel via llvm-dev" <llvm-dev at lists.llvm.org>
> To: "Jay McCarthy" <jay.mccarthy at gmail.com>
> Cc: "llvm-dev" <llvm-dev at lists.llvm.org>
> Sent: Tuesday, November 3, 2015 12:30:51 PM
> Subject: Re: [llvm-dev] Vectorizing structure reads, writes,	etc on X86-64 AVX
> 
> If the memcpy version isn't getting optimized into larger memory
> operations, that definitely sounds like a bug worth filing.
> 
> Lowering of memcpy is affected by the size of the copy, alignments of
> the source and dest, and CPU target. You may be able to narrow down
> the problem by changing those parameters.
> 

The relevant target-specific logic is in X86TargetLowering::getOptimalMemOpType, looking at that might help in understanding what's going on.

 -Hal

> 
> On Tue, Nov 3, 2015 at 11:01 AM, Jay McCarthy <
> jay.mccarthy at gmail.com > wrote:
> 
> 
> Thank you for your reply. FWIW, I wrote the .ll by hand after taking
> the C program, using clang to emit the llvm and seeing the memcpy.
> The
> memcpy version that clang generates gets compiled into assembly that
> uses the large sequence of movs and does not use the vector hardware
> at all. When I started debugging, I took that clang produced .ll and
> started to write it different ways trying to get different results.
> 
> Jay
> 
> 
> 
> On Tue, Nov 3, 2015 at 12:23 PM, Sanjay Patel <
> spatel at rotateright.com > wrote:
> > Hi Jay -
> > 
> > I'm surprised by the codegen for your examples too, but LLVM has an
> > expectation that a front-end and IR optimizer will use llvm.memcpy
> > liberally:
> > http://llvm.org/docs/doxygen/html/SelectionDAGBuilder_8cpp_source.html#l00094
> > http://llvm.org/docs/doxygen/html/SelectionDAGBuilder_8cpp_source.html#l03156
> > 
> > "Any ld-ld-st-st sequence over this should have been converted to
> > llvm.memcpy by the frontend."
> > "The optimizer should really avoid this case by converting large
> > object/array copies to llvm.memcpy"
> > 
> > 
> > So for example with clang:
> > 
> > $ cat copy.c
> > struct bagobytes {
> > int i0;
> > int i1;
> > };
> > 
> > void foo(struct bagobytes* a, struct bagobytes* b) {
> > *b = *a;
> > }
> > 
> > $ clang -O2 copy.c -S -emit-llvm -Xclang -disable-llvm-optzns -o -
> > define void @foo(%struct.bagobytes* %a, %struct.bagobytes* %b) #0 {
> > ...
> > call void @llvm.memcpy.p0i8.p0i8.i64(i8* %2, i8* %3, i64 8, i32 4,
> > i1
> > false), !tbaa.struct !6
> > ret void
> > }
> > 
> > It may still be worth filing a bug (or seeing if one is already
> > open) for
> > one of your simple examples.
> > 
> > 
> > On Thu, Oct 29, 2015 at 6:08 PM, Jay McCarthy via llvm-dev
> > < llvm-dev at lists.llvm.org > wrote:
> >> 
> >> I am a first time poster, so I apologize if this is an obvious
> >> question or out of scope for LLVM. I am an LLVM user. I don't
> >> really
> >> know anything about hacking on LLVM, but I do know a bit about
> >> compilation generally.
> >> 
> >> I am on x86-64 and I am interested in structure reads, writes, and
> >> constants being optimized to use vector registers when the
> >> alignment
> >> and sizes are right. I have created a gist of a small example:
> >> 
> >> https://gist.github.com/jeapostrophe/d54d3a6a871e5127a6ed
> >> 
> >> The assembly is produced with
> >> 
> >> llc -O3 -march=x86-64 -mcpu=corei7-avx
> >> 
> >> The key idea is that we have a structure like this:
> >> 
> >> %athing = type { float, float, float, float, float, float, i16,
> >> i16,
> >> i8, i8, i8, i8 }
> >> 
> >> That works out to be 32 bytes, so it can fit in YMM registers.
> >> 
> >> If I have two pointers to arrays of these things:
> >> 
> >> @one = external global %athing
> >> @two = external global %athing
> >> 
> >> and then I do a copy from one to the other
> >> 
> >> %a = load %athing* @two
> >> store %athing %a, %athing* @one
> >> 
> >> Then the code that is generated uses the XMM registers for the
> >> floats,
> >> but does 12 loads and then 12 stores.
> >> 
> >> In contrast, if I manually cast to a properly sized float vector I
> >> get
> >> the desired single load and single store:
> >> 
> >> %two_vector = bitcast %athing* @two to <8 x float>*
> >> %b = load <8 x float>* %two_vector
> >> %one_vector = bitcast %athing* @one to <8 x float>*
> >> store <8 x float> %b, <8 x float>* %one_vector
> >> 
> >> The rest of the file demonstrates that the code for modifying
> >> these
> >> vectors is pretty good, but has examples of bad ways to initialize
> >> the
> >> structure and a good way to initialize it. If I try to store a
> >> constant struct, I get 13 stores. If I try to assemble a vector by
> >> casting <2 x i16> to float then <4 x i8> to float and installing
> >> them
> >> into a single <8 x float>, I do get the desired single store, but
> >> I
> >> get very complicated constants that are loaded from memory. In
> >> contrast, if I bitcast the <8 x float> to <16 x i16> and <32 x i8>
> >> as
> >> I go, then I get the desired initialization with no loads and just
> >> modifications of the single YMM register. (Even this last one,
> >> however, doesn't have the best assembly because the words and
> >> bytes
> >> are not inserted into the vector simultaneously, but instead
> >> individually.)
> >> 
> >> I am kind of surprised that the obvious code didn't get optimized
> >> the
> >> way I expected and even the tedious version of the initialization
> >> isn't optimal either. I would like to know if a transformation of
> >> one
> >> to the other is feasible in LLVM (I know anything is possible, but
> >> what is feasible in this situation?) or if I should implement a
> >> transformation like this in my front-end and settle for the
> >> initialization that comes out.
> >> 
> >> Thank you for your time,
> >> 
> >> Jay
> >> 
> >> --
> >> Jay McCarthy
> >> Associate Professor
> >> PLT @ CS @ UMass Lowell
> >> http://jeapostrophe.github.io
> >> 
> >> "Wherefore, be not weary in well-doing,
> >> for ye are laying the foundation of a great work.
> >> And out of small things proceedeth that which is great."
> >> - D&C 64:33
> >> _______________________________________________
> >> LLVM Developers mailing list
> >> llvm-dev at lists.llvm.org
> >> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> > 
> > 
> 
> 
> 
> --
> Jay McCarthy
> Associate Professor
> PLT @ CS @ UMass Lowell
> http://jeapostrophe.github.io
> 
> "Wherefore, be not weary in well-doing,
> for ye are laying the foundation of a great work.
> And out of small things proceedeth that which is great."
> - D&C 64:33
> 
> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory