[Libclc-dev] [PATCH] Fix vload3/vstore3 to emit only one IR load

Fri Sep 25 14:49:26 PDT 2015

On 09/25/2015 02:32 PM, Jeroen Ketema wrote:
>> On 25 Sep 2015, at 22:20, Matt Arsenault <Matthew.Arsenault at amd.com> wrote:
>>
>> On 09/25/2015 02:13 PM, Jeroen Ketema wrote:
>>> Hi Matt,
>>>
>>> The IR below seem fishy to me: if we have
>>>
>>> vload3(get_global_id(0), A)
>>>
>>> then  the work item with the highest id is likely to access an element out of bounds of the array being passed in.
>>>
>>> Also, does the store generate a store of 4 elements, or will that be precisely be 3 elements?
>>>
>>> Jeroen
>> The store is also emitted as a <4 x i32>. I'm not sure why clang is avoiding direct load/store of 3 vectors, but this seems like a clang bug to me.
> Right, then something like:
>
> __kernel foo(__global int *A) {
>    int3 tmp = vload3(get_global_id(0), A);
>    tmp += 1;
>    vstore3(tmp, get_global_id(0), A);
> }
>
> will have a data race, e.g., on the 4th element of A. And reading the spec, I think it should be data race free "vload3 and vload_half3 read x, y, z components from address (p + (offset * 3))”.
>
> The reason why the load and store are turned in to 4 element loads and stores with your new code, is because something like:
>
> int3 B[n];
>
> will have an element of padding after each 3rd element. Hence, you’re just reading or writing an additional undefined value, which is not overlapping with one of the next element of the array (so no data race), and the reasoning is probably that a 4 element load or store is likely to be more efficient than a 3-element one. See also Section 6.1.5
> "Alignment of Types" in the OpenCL 1.2 spec.
>
> Jeroen
The code that does this was added in r162002: "Convert loads and stores 
of vec3 to vec4 to achieve better code generation. Add test case."

I don't think this is really something the frontend should be doing, but 
I guess if you can assume a 3 vector type pointer always points to the 
rounded up size, I guess it is correct.

I don't think there's any other solution than to write the 
vload3/vstore3 directly in IR then without adding some other kind of 
clang extension.