<div dir="ltr"><div dir="ltr"><div dir="ltr"><div>Hello all,</div><div><br></div><div>This is a proposal for reducing # of ptrtoint/inttoptr casts which are not</div><div>written by programmers but rather generated by LLVM passes.</div><div>Currently the majority of ptrtoint/inttoptr casts are generated by LLVM;</div><div>when compiling SPEC 2017 with LLVM r348082 (Dec 2 2018) with -O3,</div><div>the output IR contains 22,771 inttoptr instructions. However, when</div><div>compiling it with -O0, there are only 1048 inttoptrs, meaning that 95.4%</div><div>of them are generated by LLVM passes.</div><div><br></div><div>This trend is similar in ptrtoint instruction as well. When compiling SPEC 2017</div><div>with -O0, there are 23,208 ptrtoint instructions, but among them 22,016 (94.8%)</div><div>are generated by Clang frontend to represent pointer subtraction.</div><div>They aren't effectively optimized out because there are even more ptrtoints (31,721) after -O3.</div><div>This is bad for performance because existence of ptrtoint makes analysis return conservative</div><div>result as a pointer can be escaped through the cast.</div><div>Memory accesses to a pointer came from inttoptr is assumed</div><div>to possibly access anywhere, therefore it may block</div><div>store-to-load forwarding, merging two same loads, etc.</div><div><br></div><div>I believe this can be addressed by applying two patches - first one is representing pointer subtraction with a dedicated intrinsic function, llvm.psub, and second one is disabling InstCombine transformation</div><div><br></div><div>    %q = load i8*, i8** %p1</div><div>    store i8* %q, i8** %p2</div><div>=></div><div>  %1 = bitcast i8** %p1 to i64*</div><div>  %q1 = load i64, i64* %1, align 8</div><div>  %2 = bitcast i8** %p2 to i64*</div><div>  store i64 %q1, i64* %2, align 8</div><div><br></div><div>This transformation can introduce inttoptrs later if loads are followed (<a href="https://godbolt.org/z/wsZ3II">https://godbolt.org/z/wsZ3II</a> ). Both are discussed in <a href="https://bugs.llvm.org/show_bug.cgi?id=39846">https://bugs.llvm.org/show_bug.cgi?id=39846</a> as well.</div><div>After llvm.psub is used & this transformation is disabled, # of inttoptrs decreases from 22,771 to 1,565 (6.9%), and # of ptrtoints decreases from 31,721 to 7,772 (24.5%).</div><div><br></div><div>I'll introduce llvm.psub patch first.</div><div><br></div><div><br></div><div>--- Adding llvm.psub ---</div><div><br></div><div>By defining pointer subtraction intrinsic, we can get performance gain because it gives more undefined behavior than just subtracting two ptrtoints.</div><div><br></div><div>Patch <a href="https://reviews.llvm.org/D56598">https://reviews.llvm.org/D56598</a> adds llvm.psub(p1,p2) intrinsic function, which subtracts two pointers and returns the difference. Its semantic is as follows.</div><div>If p1 and p2 point to different objects, and neither of them is based on a pointer casted from an integer, `llvm.psub(p1, p2)` returns poison. For example,</div><div><br></div><div>%p = alloca</div><div>%q = alloca</div><div>%i = llvm.psub(p, q) ; %i is poison</div><div><br></div><div>This allows aggressive escape analysis on pointers. Given i = llvm.psub(p1, p2), if neither of p1 and p2 is based on a pointer casted from an integer, the llvm.psub call does not make p1 or p2 escape. (<a href="https://reviews.llvm.org/D56601">https://reviews.llvm.org/D56601</a> )</div><div><br></div><div>If either p1 or p2 is based on a pointer casted from integer, or p1 and p2 point to a same object, it returns the result of subtraction (in bytes); for example,</div><div><br></div><div>%p = alloca</div><div>%q = inttoptr %x</div><div>%i = llvm.psub(p, q) ; %i is equivalent to (ptrtoint %p) - %x</div><div><br></div><div>`null` is regarded as a pointer casted from an integer because</div><div>it is equivalent to `inttoptr 0`.</div><div><br></div><div>Adding llvm.psub allows LLVM to utilize significant portion of ptrtoints & reduce a portion of inttoptrs. After llvm.psub is used, when SPECrate 2017 is compiled with -O3, # of inttoptr decreases to ~13,500 (59%) and # of ptrtoint decreases to ~14,300 (45%).</div><div><br></div><div>To see the performance change, I ran SPECrate 2017 (thread # = 1) with three versions of LLVM, which are r313797 (Sep 21, 2017), LLVM 6.0 official, and r348082 (Dec 2, 2018). </div><div>Running r313797 shows that 505.mcf_r has consistent 2.0% speedup over 3 different machines (which are i3-6100, i5-6600, i7-7700). For LLVM 6.0 and r348082, there's neither consistent speedup nor slowdown, but the average speedup is near 0. I believe there's still a room of improvement because there are passes which are not aware of llvm.psub.</div><div><br></div><div>Thank you for reading this, and any comment is welcome.</div><div><br></div><div>Best Regards,</div><div>Juneyoung Lee</div></div></div></div>