[cfe-commits] [PATCH] Optimize vec3 loads/stores

Fri Jul 27 11:08:17 PDT 2012

On Fri, 27 Jul 2012 10:42:07 -0700
Tanya Lattner <lattner at apple.com> wrote:

> 
> On Jul 27, 2012, at 2:41 AM, Hal Finkel wrote:
> 
> > On Mon, 23 Jul 2012 13:14:07 -0700
> > Tanya Lattner <lattner at apple.com> wrote:
> > 
> >> 
> >> On Jul 18, 2012, at 6:51 PM, John McCall wrote:
> >> 
> >>> On Jul 18, 2012, at 5:37 PM, Tanya Lattner wrote:
> >>>> On Jul 18, 2012, at 5:08 AM, Benyei, Guy wrote:
> >>>>> Hi Tanya,
> >>>>> Looks good and usefull, but I'm not sure if it should be clang's
> >>>>> decision if storing and loading vec4s is better than vec3.
> >>>> 
> >>>> The idea was to have Clang generate code that the optimizers
> >>>> would be more likely to do something useful and smart with. I
> >>>> understand the concern, but I'm not sure where the best place
> >>>> for this would be then?
> >>> 
> >>> Hmm.  The IR size of a <3 x blah> is basically the size of a <4 x
> >>> blah> anyway;  arguably the backend already has all the
> >>> blah> information it needs for this.  Dan, what do you think?
> >>> 
> >>> One objection to doing this in the frontend is that it's not clear
> >>> to me that this is a transformation we should be doing if <4 x
> >>> blah> isn't actually legal for the target.  But I'm amenable to
> >>> blah> the idea that this belongs here.
> >>> 
> >> 
> >> I do not think its Clangs job to care about this as we already have
> >> this problem for other vector sizes and its target lowering's job
> >> to fix it. 
> >> 
> >>> I'm also a little uncomfortable with this patch because it's so
> >>> special-cased to 3.  I understand that that might be all that
> >>> OpenCL really cares about, but it seems silly to add this code
> >>> that doesn't also kick in for, say, <7 x i16> or whatever.  It
> >>> really shouldn't be difficult to generalize.
> >> 
> >> While it could be generalized, I am only 100% confident in the
> >> codegen for vec3 as I know for sure that it improves the code
> >> quality that is ultimately generated. This is also throughly
> >> tested by our OpenCL compiler so I am confident we are not
> >> breaking anything and we are improving performance. 
> > 
> > On the request of several people, I recently enhanced the BB
> > vectorizer to produce odd-sized vector types (when possible). This
> > was for two reasons:
> > 
> > 1. Some targets actually have instructions for length-3 vectors
> > (mostly for doing things on x,y,z triples), and they wanted
> > autovectorization support for these.
> > 
> > 2. This is generally a win elsewhere as well because the odd-length
> > vectors will be promoted to even-length vectors (which is good
> > compares to leaving scalar code).
> > 
> > In this context, I am curious to know how generating length-4
> > vectors in the frontend gives better performance. Is this something
> > that the BB vectorizer (or any other vectorizer) should be doing as
> > well?
> > 
> 
> While I think you are doing great work with your vectorizer, its not
> something we can rely on yet to get back this performance since its
> not on by default.

Thanks! I did not mean to imply that the vectorizer was, or should be,
some kind of substitute for this optimization. I wanted to know whether
I should be enhancing the vectorizer to do this same kind of "rounding
up" of the vectors it produces.

> 
> A couple more comments about my patch. First, vec3 is an OpenCL type
> that is defined in the spec that says it should have the same size
> and alignment as vec4. So generating this code pattern for
> load/stores makes sense in that context. I can only enable this if
> its OpenCL, but I think anyone who uses vec3 would want this
> performance win.

Interesting thanks! I was just wondering why there is a performance win
at all. I assume that vectors of length 3 are promoted to length-4
vectors during legalization currently. Why should declaring them to be
length 4 instead of length 3 affect the code quality at all?

> 
> Secondly, I'm not arguing that this is the ultimate fix, but it is a
> valid, simple, and easy to remove once other areas of the compiler
> are improved to handle this case. 
> 
> After discussing with others, the proper fix might be something like
> this: 1. Enhance target data to encode the "native" vector types for
> the target, along the same lines as the "native" integer types that
> it already can do.

In the medium term, we need this and a lot more (costs for shuffles,
aligned vs. unaligned loads, etc.). We should certainly have a
discussion about how to best do this (probably in a different thread).

> 2. Enhance the IR optimizer to force vectors to
> those types when it can get away with it.
> 
> This optimization needs to be early enough that other mid-level
> optimizations can take advantage of it.
> 
> I'd still like to check this in as there is not an alternative
> solution that exists right now. Because there is some debate, I think
> the code owner needs to have final say here (cc-ing Doug).

Please don't count me as objecting. We might want, however, a flag to
turn this off... it may be that at some point in the future the
LLVM-level logic will be sufficient to render this unnecessary, and
we'll probably want an easy way to run performance tests to find out
when that happens.

Thanks again,
Hal

> 
> Thanks,
> Tanya
> 
> 
> 
> > Thanks again,
> > Hal
> > 
> >> From my perspective, there is no vec7 in
> >> OpenCL, so there is no strong need for it to be generalized. It
> >> seems to make sense that it would also improve code quality for
> >> vec7. So while I could make the change, I don't have real world
> >> apps to test it. I think this can easily be left as an extension
> >> for later.
> >> 
> >>> 
> >>> Also, this optimization assumes that sizeof(element) is a power of
> >>> 2, because the only thing that the AST guarantees is that the size
> >>> of the vector type is a power of 2, and nextPow2(3 x E) ==
> >>> nextPow2(4 x E) is only true in general if E is a power of 2.  Is
> >>> that reasonable?  Is that checked somewhere?
> >>> 
> >> 
> >> I don't think this is officially checked anywhere but is a general
> >> rule.
> >> 
> >> -Tanya
> >> 
> >> 
> >>> Two minor notes.
> >>> 
> >>> +    const llvm::VectorType *VTy =
> >>> +    dyn_cast<llvm::VectorType>(EltTy);
> >>> 
> >>> How can this fail?  If it can, that's interesting;  if it can't,
> >>> it would be better if the "is this a vector operation" checks
> >>> used the IR type instead of the AST type.
> >>> 
> >>> +      llvm::PointerType *DstPtr =
> >>> cast<llvm::PointerType>(Addr->getType());
> >>> +      if (DstPtr->getElementType() != SrcTy) {
> >>> 
> >>> This can't fail in the case we care about, so you can just merge
> >>> this into the if-block above.
> >>> 
> >>> John.
> >> 
> >> _______________________________________________
> >> cfe-commits mailing list
> >> cfe-commits at cs.uiuc.edu
> >> http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits
> > 
> > 
> > 
> > -- 
> > Hal Finkel
> > Postdoctoral Appointee
> > Leadership Computing Facility
> > Argonne National Laboratory
> 

-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory