[llvm-dev] Vectorizing structure reads, writes, etc on X86-64 AVX

Thu Oct 29 17:08:08 PDT 2015

I am a first time poster, so I apologize if this is an obvious
question or out of scope for LLVM. I am an LLVM user. I don't really
know anything about hacking on LLVM, but I do know a bit about
compilation generally.

I am on x86-64 and I am interested in structure reads, writes, and
constants being optimized to use vector registers when the alignment
and sizes are right. I have created a gist of a small example:

https://gist.github.com/jeapostrophe/d54d3a6a871e5127a6ed

The assembly is produced with

llc -O3 -march=x86-64 -mcpu=corei7-avx

The key idea is that we have a structure like this:

%athing = type { float, float, float, float, float, float, i16, i16,
i8, i8, i8, i8 }

That works out to be 32 bytes, so it can fit in YMM registers.

If I have two pointers to arrays of these things:

@one = external global %athing
@two = external global %athing

and then I do a copy from one to the other

  %a = load %athing* @two
  store %athing %a, %athing* @one

Then the code that is generated uses the XMM registers for the floats,
but does 12 loads and then 12 stores.

In contrast, if I manually cast to a properly sized float vector I get
the desired single load and single store:

  %two_vector = bitcast %athing* @two to <8 x float>*
  %b = load <8 x float>* %two_vector
  %one_vector = bitcast %athing* @one to <8 x float>*
  store <8 x float> %b, <8 x float>* %one_vector

The rest of the file demonstrates that the code for modifying these
vectors is pretty good, but has examples of bad ways to initialize the
structure and a good way to initialize it. If I try to store a
constant struct, I get 13 stores. If I try to assemble a vector by
casting <2 x i16> to float then <4 x i8> to float and installing them
into a single <8 x float>, I do get the desired single store, but I
get very complicated constants that are loaded from memory. In
contrast, if I bitcast the <8 x float> to <16 x i16> and <32 x i8> as
I go, then I get the desired initialization with no loads and just
modifications of the single YMM register. (Even this last one,
however, doesn't have the best assembly because the words and bytes
are not inserted into the vector simultaneously, but instead
individually.)

I am kind of surprised that the obvious code didn't get optimized the
way I expected and even the tedious version of the initialization
isn't optimal either. I would like to know if a transformation of one
to the other is feasible in LLVM (I know anything is possible, but
what is feasible in this situation?) or if I should implement a
transformation like this in my front-end and settle for the
initialization that comes out.

Thank you for your time,

Jay

-- 
Jay McCarthy
Associate Professor
PLT @ CS @ UMass Lowell
http://jeapostrophe.github.io

           "Wherefore, be not weary in well-doing,
      for ye are laying the foundation of a great work.
And out of small things proceedeth that which is great."
                          - D&C 64:33