[PATCH] D89697: * [x86] Implement smarter instruction lowering for FP_TO_UINT for vXf32 to vXi32 from SSE2 and AVX2 by using the exact semantic of the CVTTPS2SI instruction.

Mon Oct 19 06:36:54 PDT 2020

TomHender created this revision.
TomHender added reviewers: RKSimon, craig.topper.
Herald added subscribers: llvm-commits, pengfei, dexonsmith, hiraditya.
Herald added a project: LLVM.
TomHender requested review of this revision.

We know that "CVTTPS2SI" returns 0x80000000 for out of range inputs. We can use this to make unsigned conversions from vXf32 to vXi32 more efficient, particularly on targets without blend using the following logic:
small := CVTTPS2SI(x);
fp_to_ui(x) := small | (CVTTPS2SI(x - 2^31) & ARITHMETIC_RIGHT_SHIFT(small, 31))

Even on targets where "PBLENDVPS"/"PBLENDVB" exists, it is often a latency 2, low throughput instruction so this logic is applied their too (in particular for AVX2 also). It furthermore gets rid of one high latency floating point comparison in the previous lowering.

I checked the correctness of this for all possible floats between -1 and 2^32 (both ends excluded).

I have adjusted some cost model values for this but I am not sure if I have done that right. The given costs don't look very consistent to me. For example a conversion from 8 floats to 8 uint8/int8 give me a cost of 7 although fewer instructions are generated <https://gcc.godbolt.org/#g:!((g:!((g:!((h:codeEditor,i:(fontScale:14,j:1,lang:c%2B%2B,selection:(endColumn:1,endLineNumber:3,positionColumn:1,positionLineNumber:3,selectionStartColumn:1,selectionStartLineNumber:3,startColumn:1,startLineNumber:3),source:'using+v+%3D+__attribute__((vector_size(32)))+float%3B%0Ausing+vb+%3D+__attribute__((vector_size(8)))+signed+char%3B%0Avb+get(v+a)+%7B+return+__builtin_convertvector(a,+vb)%3B+%7D'),l:'5',n:'0',o:'C%2B%2B+source+%231',t:'0')),k:55.150326797385624,l:'4',n:'0',o:'',s:0,t:'0'),(g:!((g:!((h:compiler,i:(compiler:clang_trunk,filters:(b:'0',binary:'1',commentOnly:'0',demangle:'0',directives:'0',execute:'1',intel:'0',libraryCode:'1',trim:'1'),fontScale:14,j:1,lang:c%2B%2B,libs:!(),options:'-O3+-g0',selection:(endColumn:1,endLineNumber:1,positionColumn:1,positionLineNumber:1,selectionStartColumn:1,selectionStartLineNumber:1,startColumn:1,startLineNumber:1),source:1),l:'5',n:'0',o:'x86-64+clang+(trunk)+(Editor+%231,+Compiler+%231)+C%2B%2B',t:'0')),k:44.84967320261438,l:'4',m:50,n:'0',o:'',s:0,t:'0'),(g:!((h:ir,i:(editorid:1,fontScale:14,j:1,selection:(endColumn:1,endLineNumber:1,positionColumn:1,positionLineNumber:1,selectionStartColumn:1,selectionStartLineNumber:1,startColumn:1,startLineNumber:1)),l:'5',n:'0',o:'x86-64+clang+(trunk)+IR+Viewer+(Editor+%231,+Compiler+%231)',t:'0')),header:(),l:'4',m:50,n:'0',o:'',s:0,t:'0')),k:44.84967320261438,l:'3',n:'0',o:'',t:'0')),l:'2',n:'0',o:'',t:'0')),version:4> and latencywise the dependency chain is way shorter. v4i32 to v4f64 is even more extreme with a cost of 16 altough the generated code is nice and simple <https://gcc.godbolt.org/#g:!((g:!((g:!((h:codeEditor,i:(fontScale:14,j:1,lang:c%2B%2B,selection:(endColumn:55,endLineNumber:2,positionColumn:55,positionLineNumber:2,selectionStartColumn:55,selectionStartLineNumber:2,startColumn:55,startLineNumber:2),source:'using+v+%3D+__attribute__((vector_size(32)))+double%3B%0Ausing+vb+%3D+__attribute__((vector_size(16)))+signed+int%3B%0Avb+get(v+a)+%7B+return+__builtin_convertvector(a,+vb)%3B+%7D'),l:'5',n:'0',o:'C%2B%2B+source+%231',t:'0')),k:55.150326797385624,l:'4',n:'0',o:'',s:0,t:'0'),(g:!((g:!((h:compiler,i:(compiler:clang_trunk,filters:(b:'0',binary:'1',commentOnly:'0',demangle:'0',directives:'0',execute:'1',intel:'0',libraryCode:'1',trim:'1'),fontScale:14,j:1,lang:c%2B%2B,libs:!(),options:'-O3+-g0',selection:(endColumn:20,endLineNumber:10,positionColumn:20,positionLineNumber:10,selectionStartColumn:20,selectionStartLineNumber:10,startColumn:20,startLineNumber:10),source:1),l:'5',n:'0',o:'x86-64+clang+(trunk)+(Editor+%231,+Compiler+%231)+C%2B%2B',t:'0')),k:44.84967320261438,l:'4',m:50,n:'0',o:'',s:0,t:'0'),(g:!((h:ir,i:(editorid:1,fontScale:14,j:1,selection:(endColumn:1,endLineNumber:1,positionColumn:1,positionLineNumber:1,selectionStartColumn:1,selectionStartLineNumber:1,startColumn:1,startLineNumber:1)),l:'5',n:'0',o:'x86-64+clang+(trunk)+IR+Viewer+(Editor+%231,+Compiler+%231)',t:'0')),header:(),l:'4',m:50,n:'0',o:'',s:0,t:'0')),k:44.84967320261438,l:'3',n:'0',o:'',t:'0')),l:'2',n:'0',o:'',t:'0')),version:4>. I have set the new cost for the conversion of this patch to 8 based on what was set previously and based on the latency of the longest dependency chain (The explanation on top of the file says it should be that). Additionally type legalization isn't done before looking up the cost tables so I had to add multiple entries for different vector widths which seems redundant.

Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D89697

Files:
  llvm/lib/Target/X86/X86ISelLowering.cpp
  llvm/lib/Target/X86/X86TargetTransformInfo.cpp
  llvm/test/Analysis/CostModel/X86/fptoui.ll
  llvm/test/CodeGen/X86/concat-cast.ll
  llvm/test/CodeGen/X86/ftrunc.ll
  llvm/test/CodeGen/X86/vec_cast3.ll
  llvm/test/CodeGen/X86/vec_fp_to_int.ll
  llvm/test/Transforms/SLPVectorizer/X86/alternate-cast.ll
  llvm/test/Transforms/SLPVectorizer/X86/fptoui.ll

-------------- next part --------------
A non-text attachment was scrubbed...
Name: D89697.299025.patch
Type: text/x-patch
Size: 44021 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20201019/c9d899ce/attachment.bin>