[cfe-commits] [PATCH] Optimize vec3 loads/stores
doug.gregor at gmail.com
Tue Jul 31 12:59:33 PDT 2012
On Fri, Jul 27, 2012 at 10:42 AM, Tanya Lattner <lattner at apple.com> wrote:
> On Jul 27, 2012, at 2:41 AM, Hal Finkel wrote:
>> On Mon, 23 Jul 2012 13:14:07 -0700
>> Tanya Lattner <lattner at apple.com> wrote:
>>> On Jul 18, 2012, at 6:51 PM, John McCall wrote:
>>>> On Jul 18, 2012, at 5:37 PM, Tanya Lattner wrote:
>>>>> On Jul 18, 2012, at 5:08 AM, Benyei, Guy wrote:
>>>>>> Hi Tanya,
>>>>>> Looks good and usefull, but I'm not sure if it should be clang's
>>>>>> decision if storing and loading vec4s is better than vec3.
>>>>> The idea was to have Clang generate code that the optimizers would
>>>>> be more likely to do something useful and smart with. I understand
>>>>> the concern, but I'm not sure where the best place for this would
>>>>> be then?
>>>> Hmm. The IR size of a <3 x blah> is basically the size of a <4 x
>>>> blah> anyway; arguably the backend already has all the information
>>>> blah> it needs for this. Dan, what do you think?
>>>> One objection to doing this in the frontend is that it's not clear
>>>> to me that this is a transformation we should be doing if <4 x
>>>> blah> isn't actually legal for the target. But I'm amenable to the
>>>> blah> idea that this belongs here.
>>> I do not think its Clangs job to care about this as we already have
>>> this problem for other vector sizes and its target lowering's job to
>>> fix it.
>>>> I'm also a little uncomfortable with this patch because it's so
>>>> special-cased to 3. I understand that that might be all that
>>>> OpenCL really cares about, but it seems silly to add this code that
>>>> doesn't also kick in for, say, <7 x i16> or whatever. It really
>>>> shouldn't be difficult to generalize.
>>> While it could be generalized, I am only 100% confident in the
>>> codegen for vec3 as I know for sure that it improves the code quality
>>> that is ultimately generated. This is also throughly tested by our
>>> OpenCL compiler so I am confident we are not breaking anything and we
>>> are improving performance.
>> On the request of several people, I recently enhanced the BB vectorizer
>> to produce odd-sized vector types (when possible). This was for two
>> 1. Some targets actually have instructions for length-3 vectors
>> (mostly for doing things on x,y,z triples), and they wanted
>> autovectorization support for these.
>> 2. This is generally a win elsewhere as well because the odd-length
>> vectors will be promoted to even-length vectors (which is good
>> compares to leaving scalar code).
>> In this context, I am curious to know how generating length-4 vectors
>> in the frontend gives better performance. Is this something that the BB
>> vectorizer (or any other vectorizer) should be doing as well?
> While I think you are doing great work with your vectorizer, its not something we can rely on yet to get back this performance since its not on by default.
> A couple more comments about my patch. First, vec3 is an OpenCL type that is defined in the spec that says it should have the same size and alignment as vec4. So generating this code pattern for load/stores makes sense in that context. I can only enable this if its OpenCL, but I think anyone who uses vec3 would want this performance win.
> Secondly, I'm not arguing that this is the ultimate fix, but it is a valid, simple, and easy to remove once other areas of the compiler are improved to handle this case.
> After discussing with others, the proper fix might be something like this:
> 1. Enhance target data to encode the "native" vector types for the target, along the same lines as the "native" integer types that it already can do.
> 2. Enhance the IR optimizer to force vectors to those types when it can get away with it.
> This optimization needs to be early enough that other mid-level optimizations can take advantage of it.
> I'd still like to check this in as there is not an alternative solution that exists right now. Because there is some debate, I think the code owner needs to have final say here (cc-ing Doug).
I think it makes sense for this to go in. If our optimizers improve to
the point where they can handle vec3 directly as well as/better than
vec3-lowered-as-vec4, we can revisit this topic.
More information about the cfe-commits