[Libclc-dev] [PATCH] Fix vload3/vstore3 to emit only one IR load

Wed Sep 30 03:42:49 PDT 2015

Hi Matt,

I generally think it makes sense to perform some optimisations in a compiler front-end (because it has certain information that later phases might not have). However, I appreciate that there are different schools of thought here.

Regarding the problem at hand: indeed the best approach might be to encode vload3/store3 in IR (unfortunately).

Best,

 Jeroen

> On 25 Sep 2015, at 22:49, Matt Arsenault <Matthew.Arsenault at amd.com> wrote:
> 
> On 09/25/2015 02:32 PM, Jeroen Ketema wrote:
>>> On 25 Sep 2015, at 22:20, Matt Arsenault <Matthew.Arsenault at amd.com> wrote:
>>> 
>>> On 09/25/2015 02:13 PM, Jeroen Ketema wrote:
>>>> Hi Matt,
>>>> 
>>>> The IR below seem fishy to me: if we have
>>>> 
>>>> vload3(get_global_id(0), A)
>>>> 
>>>> then  the work item with the highest id is likely to access an element out of bounds of the array being passed in.
>>>> 
>>>> Also, does the store generate a store of 4 elements, or will that be precisely be 3 elements?
>>>> 
>>>> Jeroen
>>> The store is also emitted as a <4 x i32>. I'm not sure why clang is avoiding direct load/store of 3 vectors, but this seems like a clang bug to me.
>> Right, then something like:
>> 
>> __kernel foo(__global int *A) {
>>   int3 tmp = vload3(get_global_id(0), A);
>>   tmp += 1;
>>   vstore3(tmp, get_global_id(0), A);
>> }
>> 
>> will have a data race, e.g., on the 4th element of A. And reading the spec, I think it should be data race free "vload3 and vload_half3 read x, y, z components from address (p + (offset * 3))”.
>> 
>> The reason why the load and store are turned in to 4 element loads and stores with your new code, is because something like:
>> 
>> int3 B[n];
>> 
>> will have an element of padding after each 3rd element. Hence, you’re just reading or writing an additional undefined value, which is not overlapping with one of the next element of the array (so no data race), and the reasoning is probably that a 4 element load or store is likely to be more efficient than a 3-element one. See also Section 6.1.5
>> "Alignment of Types" in the OpenCL 1.2 spec.
>> 
>> Jeroen
> The code that does this was added in r162002: "Convert loads and stores of vec3 to vec4 to achieve better code generation. Add test case."
> 
> I don't think this is really something the frontend should be doing, but I guess if you can assume a 3 vector type pointer always points to the rounded up size, I guess it is correct.
> 
> I don't think there's any other solution than to write the vload3/vstore3 directly in IR then without adding some other kind of clang extension.