[LLVMdev] RFC: Indexing of structs vs arrays in getelementpointer
Louis Gerbarg
lgg at apple.com
Thu May 22 15:42:36 PDT 2014
Recently I posted a patch to migrate certain GEPs between basic blocks in cases where doing so would improve the ability of instcombine to merge into more complicated addressing mode (r209049 and r209065). After some build to failures it was rolled back. I now have a patch that no longer causes the regressions I was seeing, but it also no longer can optimize the case I was trying to optimize. As several other people have shown me optimization cases where the patch does work for them I am going to commit it again once I write some new test cases people who have access to the relevant (failing) configs can verify that it works for them, but I wanted to discuss the the case that I was trying to optimize, and how transforming the types of some some values would allow for the optimization to occur.
It is worth noting up front that I could certainly do this optimization in a peephole, but I think by looking at some broader changes it might be possible to tease out other optimization opportunities.
The problem:
On x86 when you compile:
unsigned testu(llvm::DenseMap<unsigned, unsigned> &dm, unsigned key) {
return dm.lookup(key);
}
is compiled down it contains:
leaq (%r8,%rdi,8), %rax
movl 4(%rax), %eax
when what you really want is:
movl 4(%r8,%rdi,8), %eax
Exactly how bad that is depends on the exact micro architecture as some machines have different cache port configs and AGU capabilities, but it isn’t good. The reason why we get into this situation is that The IR that results GEP->PHI->GEP->LOAD chain. Currently the first GEP and the GEP->LOAD are being matched separately into the leaq and movl respectively. I wrote some code to move GEPs across PHIs so that the whole GEP->GEP->LOAD could be matched, and it first glance it worked and generated the optimized movl:
%struct2 = type { i32, i32 }
bb1:
%tmp10 = getelementptr inbounds %struct2* %tmp1, i64 %tmp9
br label %bb3
bb2:
%tmp20 = getelementptr inbounds %struct2* %tmp1, i64 %tmp19
br label %bb3
bb3:
%phi = phi %struct2* [ %tmp10, %bb1 ], [ %tmp20, %bb2 ]
%tmp24 = getelementptr inbounds %struct2* %phi, i64 0, i32 1
%tmp25 = load i32* %tmp24, align 4
is rewritten as:
%struct2 = type { i32, i32 }
bb1:
br label %bb3
bb2:
br label %bb3
bb3:
%phi = phi %struct2* [ %tmp9, %bb1 ], [ %tmp19, %bb2 ]
%tno21 = getelementptr inbounds %struct2* %tmp1, i64 % phi
%tmp24 = getelementptr inbounds %struct2* %phi, i64 0, i32 1
%tmp25 = load i32* %tmp24, align 4
Which eventually gets InstCombined into a single getelementptr and matched to the single movl.
The problem that the above transform is technically illegal because “When indexing into a (optionally packed) structure, only i32 integer constants are allowed (when using a vector of indices they must all be the same i32 integer constant).” rule <http://llvm.org/docs/LangRef.html#getelementptr-instruction>. In practice it worked in this case, I can only presume that despite the language ref claiming that GEPs into structs require constant indexes that in practice variable indexes work so long as the elements all have a common size/alignment. Having said that it blew up in other cases, and when I added a check to prevent merging indexes that were structs it caused the above case to no longer be optimizable.
What to do about it:
Well, there are a couple of pragmatic solutions that I could apply. I could handle it as a peephole on x86, or do some bit casts in the IR. Before I go down that path I wanted to ask a more interesting question: Are we missing other optimization opportunities because we are using structs for homogenous data that could be expressed in terms of arrays? If we were to write an early pass that effectively rewrote structs into arrays would we see any improvements (besides the one I mentioned in this email). There are a couple of different ways of doing it:
{i32, i32} has the same layout as [2 x i32] or {[2 x i32]}. My gut feeling is that [2 x i32] is the most optimizable, but that {[2 x i32]} probably is almost and good and potentially eliminates certain edge cases. How about heterogenous structs with homogenous runs: Does rewriting {i32, i32, i16} as {[2 x i32], i16} allow for better optimizations? Or maybe it makes sense to loosen the language restrictions on GEP indexes?
Louis
More information about the llvm-dev
mailing list