[llvm-dev] Reducing the number of ptrtoint/inttoptrs that are generated by LLVM
Juneyoung Lee via llvm-dev
llvm-dev at lists.llvm.org
Mon Jan 14 03:23:23 PST 2019
This is a proposal for reducing # of ptrtoint/inttoptr casts which are not
written by programmers but rather generated by LLVM passes.
Currently the majority of ptrtoint/inttoptr casts are generated by LLVM;
when compiling SPEC 2017 with LLVM r348082 (Dec 2 2018) with -O3,
the output IR contains 22,771 inttoptr instructions. However, when
compiling it with -O0, there are only 1048 inttoptrs, meaning that 95.4%
of them are generated by LLVM passes.
This trend is similar in ptrtoint instruction as well. When compiling SPEC
with -O0, there are 23,208 ptrtoint instructions, but among them 22,016
are generated by Clang frontend to represent pointer subtraction.
They aren't effectively optimized out because there are even more ptrtoints
(31,721) after -O3.
This is bad for performance because existence of ptrtoint makes analysis
result as a pointer can be escaped through the cast.
Memory accesses to a pointer came from inttoptr is assumed
to possibly access anywhere, therefore it may block
store-to-load forwarding, merging two same loads, etc.
I believe this can be addressed by applying two patches - first one is
representing pointer subtraction with a dedicated intrinsic function,
llvm.psub, and second one is disabling InstCombine transformation
%q = load i8*, i8** %p1
store i8* %q, i8** %p2
%1 = bitcast i8** %p1 to i64*
%q1 = load i64, i64* %1, align 8
%2 = bitcast i8** %p2 to i64*
store i64 %q1, i64* %2, align 8
This transformation can introduce inttoptrs later if loads are followed (
https://godbolt.org/z/wsZ3II ). Both are discussed in
https://bugs.llvm.org/show_bug.cgi?id=39846 as well.
After llvm.psub is used & this transformation is disabled, # of inttoptrs
decreases from 22,771 to 1,565 (6.9%), and # of ptrtoints decreases from
31,721 to 7,772 (24.5%).
I'll introduce llvm.psub patch first.
--- Adding llvm.psub ---
By defining pointer subtraction intrinsic, we can get performance gain
because it gives more undefined behavior than just subtracting two
Patch https://reviews.llvm.org/D56598 adds llvm.psub(p1,p2) intrinsic
function, which subtracts two pointers and returns the difference. Its
semantic is as follows.
If p1 and p2 point to different objects, and neither of them is based on a
pointer casted from an integer, `llvm.psub(p1, p2)` returns poison. For
%p = alloca
%q = alloca
%i = llvm.psub(p, q) ; %i is poison
This allows aggressive escape analysis on pointers. Given i = llvm.psub(p1,
p2), if neither of p1 and p2 is based on a pointer casted from an integer,
the llvm.psub call does not make p1 or p2 escape. (
If either p1 or p2 is based on a pointer casted from integer, or p1 and p2
point to a same object, it returns the result of subtraction (in bytes);
%p = alloca
%q = inttoptr %x
%i = llvm.psub(p, q) ; %i is equivalent to (ptrtoint %p) - %x
`null` is regarded as a pointer casted from an integer because
it is equivalent to `inttoptr 0`.
Adding llvm.psub allows LLVM to utilize significant portion of ptrtoints &
reduce a portion of inttoptrs. After llvm.psub is used, when SPECrate 2017
is compiled with -O3, # of inttoptr decreases to ~13,500 (59%) and # of
ptrtoint decreases to ~14,300 (45%).
To see the performance change, I ran SPECrate 2017 (thread # = 1) with
three versions of LLVM, which are r313797 (Sep 21, 2017), LLVM 6.0
official, and r348082 (Dec 2, 2018).
Running r313797 shows that 505.mcf_r has consistent 2.0% speedup over 3
different machines (which are i3-6100, i5-6600, i7-7700). For LLVM 6.0 and
r348082, there's neither consistent speedup nor slowdown, but the average
speedup is near 0. I believe there's still a room of improvement because
there are passes which are not aware of llvm.psub.
Thank you for reading this, and any comment is welcome.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the llvm-dev