[PATCH] D89697: * [x86] Implement smarter instruction lowering for FP_TO_UINT for vXf32 to vXi32 from SSE2 and AVX2 by using the exact semantic of the CVTTPS2SI instruction.
Tom Hender via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Mon Oct 19 06:36:54 PDT 2020
TomHender created this revision.
TomHender added reviewers: RKSimon, craig.topper.
Herald added subscribers: llvm-commits, pengfei, dexonsmith, hiraditya.
Herald added a project: LLVM.
TomHender requested review of this revision.
We know that "CVTTPS2SI" returns 0x80000000 for out of range inputs. We can use this to make unsigned conversions from vXf32 to vXi32 more efficient, particularly on targets without blend using the following logic:
small := CVTTPS2SI(x);
fp_to_ui(x) := small | (CVTTPS2SI(x - 2^31) & ARITHMETIC_RIGHT_SHIFT(small, 31))
Even on targets where "PBLENDVPS"/"PBLENDVB" exists, it is often a latency 2, low throughput instruction so this logic is applied their too (in particular for AVX2 also). It furthermore gets rid of one high latency floating point comparison in the previous lowering.
I checked the correctness of this for all possible floats between -1 and 2^32 (both ends excluded).
I have adjusted some cost model values for this but I am not sure if I have done that right. The given costs don't look very consistent to me. For example a conversion from 8 floats to 8 uint8/int8 give me a cost of 7 although fewer instructions are generated <https://gcc.godbolt.org/#g:!((g:!((g:!((h:codeEditor,i:(fontScale:14,j:1,lang:c%2B%2B,selection:(endColumn:1,endLineNumber:3,positionColumn:1,positionLineNumber:3,selectionStartColumn:1,selectionStartLineNumber:3,startColumn:1,startLineNumber:3),source:'using+v+%3D+__attribute__((vector_size(32)))+float%3B%0Ausing+vb+%3D+__attribute__((vector_size(8)))+signed+char%3B%0Avb+get(v+a)+%7B+return+__builtin_convertvector(a,+vb)%3B+%7D'),l:'5',n:'0',o:'C%2B%2B+source+%231',t:'0')),k:55.150326797385624,l:'4',n:'0',o:'',s:0,t:'0'),(g:!((g:!((h:compiler,i:(compiler:clang_trunk,filters:(b:'0',binary:'1',commentOnly:'0',demangle:'0',directives:'0',execute:'1',intel:'0',libraryCode:'1',trim:'1'),fontScale:14,j:1,lang:c%2B%2B,libs:!(),options:'-O3+-g0',selection:(endColumn:1,endLineNumber:1,positionColumn:1,positionLineNumber:1,selectionStartColumn:1,selectionStartLineNumber:1,startColumn:1,startLineNumber:1),source:1),l:'5',n:'0',o:'x86-64+clang+(trunk)+(Editor+%231,+Compiler+%231)+C%2B%2B',t:'0')),k:44.84967320261438,l:'4',m:50,n:'0',o:'',s:0,t:'0'),(g:!((h:ir,i:(editorid:1,fontScale:14,j:1,selection:(endColumn:1,endLineNumber:1,positionColumn:1,positionLineNumber:1,selectionStartColumn:1,selectionStartLineNumber:1,startColumn:1,startLineNumber:1)),l:'5',n:'0',o:'x86-64+clang+(trunk)+IR+Viewer+(Editor+%231,+Compiler+%231)',t:'0')),header:(),l:'4',m:50,n:'0',o:'',s:0,t:'0')),k:44.84967320261438,l:'3',n:'0',o:'',t:'0')),l:'2',n:'0',o:'',t:'0')),version:4> and latencywise the dependency chain is way shorter. v4i32 to v4f64 is even more extreme with a cost of 16 altough the generated code is nice and simple <https://gcc.godbolt.org/#g:!((g:!((g:!((h:codeEditor,i:(fontScale:14,j:1,lang:c%2B%2B,selection:(endColumn:55,endLineNumber:2,positionColumn:55,positionLineNumber:2,selectionStartColumn:55,selectionStartLineNumber:2,startColumn:55,startLineNumber:2),source:'using+v+%3D+__attribute__((vector_size(32)))+double%3B%0Ausing+vb+%3D+__attribute__((vector_size(16)))+signed+int%3B%0Avb+get(v+a)+%7B+return+__builtin_convertvector(a,+vb)%3B+%7D'),l:'5',n:'0',o:'C%2B%2B+source+%231',t:'0')),k:55.150326797385624,l:'4',n:'0',o:'',s:0,t:'0'),(g:!((g:!((h:compiler,i:(compiler:clang_trunk,filters:(b:'0',binary:'1',commentOnly:'0',demangle:'0',directives:'0',execute:'1',intel:'0',libraryCode:'1',trim:'1'),fontScale:14,j:1,lang:c%2B%2B,libs:!(),options:'-O3+-g0',selection:(endColumn:20,endLineNumber:10,positionColumn:20,positionLineNumber:10,selectionStartColumn:20,selectionStartLineNumber:10,startColumn:20,startLineNumber:10),source:1),l:'5',n:'0',o:'x86-64+clang+(trunk)+(Editor+%231,+Compiler+%231)+C%2B%2B',t:'0')),k:44.84967320261438,l:'4',m:50,n:'0',o:'',s:0,t:'0'),(g:!((h:ir,i:(editorid:1,fontScale:14,j:1,selection:(endColumn:1,endLineNumber:1,positionColumn:1,positionLineNumber:1,selectionStartColumn:1,selectionStartLineNumber:1,startColumn:1,startLineNumber:1)),l:'5',n:'0',o:'x86-64+clang+(trunk)+IR+Viewer+(Editor+%231,+Compiler+%231)',t:'0')),header:(),l:'4',m:50,n:'0',o:'',s:0,t:'0')),k:44.84967320261438,l:'3',n:'0',o:'',t:'0')),l:'2',n:'0',o:'',t:'0')),version:4>. I have set the new cost for the conversion of this patch to 8 based on what was set previously and based on the latency of the longest dependency chain (The explanation on top of the file says it should be that). Additionally type legalization isn't done before looking up the cost tables so I had to add multiple entries for different vector widths which seems redundant.
Repository:
rG LLVM Github Monorepo
https://reviews.llvm.org/D89697
Files:
llvm/lib/Target/X86/X86ISelLowering.cpp
llvm/lib/Target/X86/X86TargetTransformInfo.cpp
llvm/test/Analysis/CostModel/X86/fptoui.ll
llvm/test/CodeGen/X86/concat-cast.ll
llvm/test/CodeGen/X86/ftrunc.ll
llvm/test/CodeGen/X86/vec_cast3.ll
llvm/test/CodeGen/X86/vec_fp_to_int.ll
llvm/test/Transforms/SLPVectorizer/X86/alternate-cast.ll
llvm/test/Transforms/SLPVectorizer/X86/fptoui.ll
-------------- next part --------------
A non-text attachment was scrubbed...
Name: D89697.299025.patch
Type: text/x-patch
Size: 44021 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20201019/c9d899ce/attachment.bin>
More information about the llvm-commits
mailing list