[llvm-dev] Bitwise AND blocks re-association
Timofeev, Alexander via llvm-dev
llvm-dev at lists.llvm.org
Wed Nov 16 07:28:49 PST 2016
Hi All,
I'm writing to ask for the suggestion.
I'm working on the AMDGPU project. I was recently analyzing our backend performance looking for optimization opportunities and found that LoadStore vectorizer is unable to handle simple case.
The deeper look revealed that the reason was in a i64 to i32 truncation.
The OpenCL code looks like this:
__kernel void read_linear(__global float *input,__global float *output)
{
float val = 0.0f;
uint gid = get_global_id(0);
val = val + input[gid + 0];
val = val + input[gid + 1];
val = val + input[gid + 2];
val = val + input[gid + 3];
val = val + input[gid + 4];
*** and so on... ***
output[gid] = val;
}
I'd expect loads to be combined since they all have same base and offsets are constants and consecutively increasing.
That is exactly what happens if I change 'uint' to 'ulong' in the assignment: uint gid = get_global_id(0) => ulong gid = get_global_id(0) - loads are combined as expected.
The reason is simple:
OpenCL get_global_id(uint) returns i64
The result is assigned to i32
Clang generates trunc i64 to i32 as AND with FFFFFFFF:
%call = tail call i64 @_Z13get_global_idj(i32 0) #2
%idxprom = and i64 %call, 4294967295
%arrayidx = getelementptr inbounds float, float addrspace(1)* %input, i64 %idxprom
%0 = load float, float addrspace(1)* %arrayidx, align 4, !tbaa !7
So, LoadStoreVectorizer cannot combine with the next load:
%add2 = add i64 %call, 1
%idxprom3 = and i64 %add2, 4294967295
%arrayidx4 = getelementptr inbounds float, float addrspace(1)* %input, i64 %idxprom3
%1 = load float, float addrspace(1)* %arrayidx4, align 4, !tbaa !7
Basically we have a case: C1 & ( A + C2 ) => C1 & A + C1 & C2 that is not always legal.
In our case that really is trunc(A+C) => trunc(A) + trunc(C) that is legal.
NaryReassociate cannot handle this because it only considers ADD, MUL, GEP
Reassociate cannot handle this either just because it explicitly looks for "X&~X == 0, X|~X == -1." or "Y ^ X^X -> Y"
We definitely should be able to optimize 0xFFFFFFFF & ( A + 1 ) to 0xFFFFFFFF & A + 1 given that bitwise AND with FFFFFFFF is a special case.
We maybe want be able to enhance bitwise operation processing to handle more common cases.
I'd like to know your opinion on this topic.
Thanks in advance.
Alexander.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20161116/c214204c/attachment-0001.html>
More information about the llvm-dev
mailing list