[cfe-commits] [PATCH] Optimize vec3 loads/stores

Fri Jul 27 10:42:07 PDT 2012

On Jul 27, 2012, at 2:41 AM, Hal Finkel wrote:

> On Mon, 23 Jul 2012 13:14:07 -0700
> Tanya Lattner <lattner at apple.com> wrote:
> 
>> 
>> On Jul 18, 2012, at 6:51 PM, John McCall wrote:
>> 
>>> On Jul 18, 2012, at 5:37 PM, Tanya Lattner wrote:
>>>> On Jul 18, 2012, at 5:08 AM, Benyei, Guy wrote:
>>>>> Hi Tanya,
>>>>> Looks good and usefull, but I'm not sure if it should be clang's
>>>>> decision if storing and loading vec4s is better than vec3.
>>>> 
>>>> The idea was to have Clang generate code that the optimizers would
>>>> be more likely to do something useful and smart with. I understand
>>>> the concern, but I'm not sure where the best place for this would
>>>> be then?
>>> 
>>> Hmm.  The IR size of a <3 x blah> is basically the size of a <4 x
>>> blah> anyway;  arguably the backend already has all the information
>>> blah> it needs for this.  Dan, what do you think?
>>> 
>>> One objection to doing this in the frontend is that it's not clear
>>> to me that this is a transformation we should be doing if <4 x
>>> blah> isn't actually legal for the target.  But I'm amenable to the
>>> blah> idea that this belongs here.
>>> 
>> 
>> I do not think its Clangs job to care about this as we already have
>> this problem for other vector sizes and its target lowering's job to
>> fix it. 
>> 
>>> I'm also a little uncomfortable with this patch because it's so
>>> special-cased to 3.  I understand that that might be all that
>>> OpenCL really cares about, but it seems silly to add this code that
>>> doesn't also kick in for, say, <7 x i16> or whatever.  It really
>>> shouldn't be difficult to generalize.
>> 
>> While it could be generalized, I am only 100% confident in the
>> codegen for vec3 as I know for sure that it improves the code quality
>> that is ultimately generated. This is also throughly tested by our
>> OpenCL compiler so I am confident we are not breaking anything and we
>> are improving performance. 
> 
> On the request of several people, I recently enhanced the BB vectorizer
> to produce odd-sized vector types (when possible). This was for two
> reasons:
> 
> 1. Some targets actually have instructions for length-3 vectors
> (mostly for doing things on x,y,z triples), and they wanted
> autovectorization support for these.
> 
> 2. This is generally a win elsewhere as well because the odd-length
> vectors will be promoted to even-length vectors (which is good
> compares to leaving scalar code).
> 
> In this context, I am curious to know how generating length-4 vectors
> in the frontend gives better performance. Is this something that the BB
> vectorizer (or any other vectorizer) should be doing as well?
> 

While I think you are doing great work with your vectorizer, its not something we can rely on yet to get back this performance since its not on by default.

A couple more comments about my patch. First, vec3 is an OpenCL type that is defined in the spec that says it should have the same size and alignment as vec4. So generating this code pattern for load/stores makes sense in that context. I can only enable this if its OpenCL, but I think anyone who uses vec3 would want this performance win.

Secondly, I'm not arguing that this is the ultimate fix, but it is a valid, simple, and easy to remove once other areas of the compiler are improved to handle this case. 

After discussing with others, the proper fix might be something like this:
1. Enhance target data to encode the "native" vector types for the target, along the same lines as the "native" integer types that it already can do.
2. Enhance the IR optimizer to force vectors to those types when it can get away with it.

This optimization needs to be early enough that other mid-level optimizations can take advantage of it.

I'd still like to check this in as there is not an alternative solution that exists right now. Because there is some debate, I think the code owner needs to have final say here (cc-ing Doug).

Thanks,
Tanya

> Thanks again,
> Hal
> 
>> From my perspective, there is no vec7 in
>> OpenCL, so there is no strong need for it to be generalized. It seems
>> to make sense that it would also improve code quality for vec7. So
>> while I could make the change, I don't have real world apps to test
>> it. I think this can easily be left as an extension for later.
>> 
>>> 
>>> Also, this optimization assumes that sizeof(element) is a power of
>>> 2, because the only thing that the AST guarantees is that the size
>>> of the vector type is a power of 2, and nextPow2(3 x E) ==
>>> nextPow2(4 x E) is only true in general if E is a power of 2.  Is
>>> that reasonable?  Is that checked somewhere?
>>> 
>> 
>> I don't think this is officially checked anywhere but is a general
>> rule.
>> 
>> -Tanya
>> 
>> 
>>> Two minor notes.
>>> 
>>> +    const llvm::VectorType *VTy =
>>> +    dyn_cast<llvm::VectorType>(EltTy);
>>> 
>>> How can this fail?  If it can, that's interesting;  if it can't, it
>>> would be better if the "is this a vector operation" checks used the
>>> IR type instead of the AST type.
>>> 
>>> +      llvm::PointerType *DstPtr =
>>> cast<llvm::PointerType>(Addr->getType());
>>> +      if (DstPtr->getElementType() != SrcTy) {
>>> 
>>> This can't fail in the case we care about, so you can just merge
>>> this into the if-block above.
>>> 
>>> John.
>> 
>> _______________________________________________
>> cfe-commits mailing list
>> cfe-commits at cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/cfe-commits
> 
> 
> 
> -- 
> Hal Finkel
> Postdoctoral Appointee
> Leadership Computing Facility
> Argonne National Laboratory