[cfe-commits] [PATCH] Optimize vec3 loads/stores

Tue Jul 31 13:34:24 PDT 2012

On Tue, Jul 31, 2012 at 1:24 PM, Hal Finkel <hfinkel at anl.gov> wrote:
> On Tue, 31 Jul 2012 12:59:33 -0700
> Doug Gregor <doug.gregor at gmail.com> wrote:
>
>> On Fri, Jul 27, 2012 at 10:42 AM, Tanya Lattner <lattner at apple.com>
>> wrote:
>> >
>> > On Jul 27, 2012, at 2:41 AM, Hal Finkel wrote:
>> >
>> >> On Mon, 23 Jul 2012 13:14:07 -0700
>> >> Tanya Lattner <lattner at apple.com> wrote:
>> >>
>> >>>
>> >>> On Jul 18, 2012, at 6:51 PM, John McCall wrote:
>> >>>
>> >>>> On Jul 18, 2012, at 5:37 PM, Tanya Lattner wrote:
>> >>>>> On Jul 18, 2012, at 5:08 AM, Benyei, Guy wrote:
>> >>>>>> Hi Tanya,
>> >>>>>> Looks good and usefull, but I'm not sure if it should be
>> >>>>>> clang's decision if storing and loading vec4s is better than
>> >>>>>> vec3.
>> >>>>>
>> >>>>> The idea was to have Clang generate code that the optimizers
>> >>>>> would be more likely to do something useful and smart with. I
>> >>>>> understand the concern, but I'm not sure where the best place
>> >>>>> for this would be then?
>> >>>>
>> >>>> Hmm.  The IR size of a <3 x blah> is basically the size of a <4 x
>> >>>> blah> anyway;  arguably the backend already has all the
>> >>>> blah> information it needs for this.  Dan, what do you think?
>> >>>>
>> >>>> One objection to doing this in the frontend is that it's not
>> >>>> clear to me that this is a transformation we should be doing if
>> >>>> <4 x
>> >>>> blah> isn't actually legal for the target.  But I'm amenable to
>> >>>> blah> the idea that this belongs here.
>> >>>>
>> >>>
>> >>> I do not think its Clangs job to care about this as we already
>> >>> have this problem for other vector sizes and its target
>> >>> lowering's job to fix it.
>> >>>
>> >>>> I'm also a little uncomfortable with this patch because it's so
>> >>>> special-cased to 3.  I understand that that might be all that
>> >>>> OpenCL really cares about, but it seems silly to add this code
>> >>>> that doesn't also kick in for, say, <7 x i16> or whatever.  It
>> >>>> really shouldn't be difficult to generalize.
>> >>>
>> >>> While it could be generalized, I am only 100% confident in the
>> >>> codegen for vec3 as I know for sure that it improves the code
>> >>> quality that is ultimately generated. This is also throughly
>> >>> tested by our OpenCL compiler so I am confident we are not
>> >>> breaking anything and we are improving performance.
>> >>
>> >> On the request of several people, I recently enhanced the BB
>> >> vectorizer to produce odd-sized vector types (when possible). This
>> >> was for two reasons:
>> >>
>> >> 1. Some targets actually have instructions for length-3 vectors
>> >> (mostly for doing things on x,y,z triples), and they wanted
>> >> autovectorization support for these.
>> >>
>> >> 2. This is generally a win elsewhere as well because the odd-length
>> >> vectors will be promoted to even-length vectors (which is good
>> >> compares to leaving scalar code).
>> >>
>> >> In this context, I am curious to know how generating length-4
>> >> vectors in the frontend gives better performance. Is this
>> >> something that the BB vectorizer (or any other vectorizer) should
>> >> be doing as well?
>> >>
>> >
>> > While I think you are doing great work with your vectorizer, its
>> > not something we can rely on yet to get back this performance since
>> > its not on by default.
>> >
>> > A couple more comments about my patch. First, vec3 is an OpenCL
>> > type that is defined in the spec that says it should have the same
>> > size and alignment as vec4. So generating this code pattern for
>> > load/stores makes sense in that context. I can only enable this if
>> > its OpenCL, but I think anyone who uses vec3 would want this
>> > performance win.
>> >
>> > Secondly, I'm not arguing that this is the ultimate fix, but it is
>> > a valid, simple, and easy to remove once other areas of the
>> > compiler are improved to handle this case.
>> >
>> > After discussing with others, the proper fix might be something
>> > like this: 1. Enhance target data to encode the "native" vector
>> > types for the target, along the same lines as the "native" integer
>> > types that it already can do. 2. Enhance the IR optimizer to force
>> > vectors to those types when it can get away with it.
>> >
>> > This optimization needs to be early enough that other mid-level
>> > optimizations can take advantage of it.
>> >
>> > I'd still like to check this in as there is not an alternative
>> > solution that exists right now. Because there is some debate, I
>> > think the code owner needs to have final say here (cc-ing Doug).
>>
>> I think it makes sense for this to go in. If our optimizers improve to
>> the point where they can handle vec3 directly as well as/better than
>> vec3-lowered-as-vec4, we can revisit this topic.
>
> Would it make sense to have a command-line switch to enable the old
> behavior so that the status of the backend optimizers can be easily
> checked?

At this point, doing that seems like extra work. If there's going to
be a concerted effort throughout the optimizers to make such vectors
work well, then it's worth adding that option as part of that
concerted effort.

  - Doug