<html><head></head><body dir="auto" style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;"><br><blockquote type="cite">On Oct 22, 2015, at 9:01 AM, Sanjay Patel <spatel@rotateright.com> wrote:<br><br>spatel added a comment.<br><br>In http://reviews.llvm.org/D13417#272787, @Gerolf wrote:<br><br><blockquote type="cite">That is still the same test case and the outcome should be independent of compiler revisions.<br></blockquote><br><br>OK - at least we are certain that we are analyzing the same example now. :)<br></blockquote>I never had doubts :-)<br><blockquote type="cite"><br><blockquote type="cite">W/o reassoc the fill after the last call is not removed, but you get more ILP out of the 3 vadds. W reassoc there is no fill, but less ILP. You can’t get both higher ILP *and* a correct program, since in your proposed faster sequence xmm1 would not be initialized:<br></blockquote><br><blockquote type="cite"><br></blockquote><br><blockquote type="cite"> vaddsd 16(%rsp), %xmm1, %xmm1 <--- c + d<br></blockquote><br><blockquote type="cite"> vaddsd (%rsp), %xmm0, %xmm0 <--- a + b [independent]<br></blockquote><br><blockquote type="cite"> vaddsd %xmm0, %xmm1, %xmm0 <--- (a + b) + (c + d)<br></blockquote><br><blockquote type="cite"><br></blockquote><br><blockquote type="cite"><br></blockquote><br><blockquote type="cite">What value do you expect in xmm1 in the first vaddsd?<br></blockquote><br><br>As we can see in the test's CHECK line, xmm1 is loaded just ahead of the 'add' in the case where the reassociation is firing:<br>; AVX-NEXT: vmovsd 8(%rsp), %xmm1<br><br>I don't see how register initialization or load folding is relevant in this discussion. There are 3 loads and 3 adds regardless of reassociation, and one operand is already in a register (xmm0) as the result of the final 'call' instruction. But as you noted earlier, the machine combiner doesn't know about loads/spills, so all operands are in virtual registers as far as it knows.<br><br>Perhaps it makes more sense to look at the debug output for the machine combiner pass rather than the final asm?<br></blockquote>It would be great to have an example that actually shows a better assembly sequence. Specifically I would like to see a similar example when there is no call in the block.<br><br><blockquote type="cite">This is the machine trace metric depth info before reassociation occurs:<br><br> Depths for BB#0:<br> 0 Instructions<br> 0 ADJCALLSTACKDOWN64 0, 0, %RSP<imp-def,dead>, %EFLAGS<imp-def,dead>, %RSP<imp-use><br> 0 CALL64pcrel32 <ga:@bar>, <regmask ..., %RSP<imp-use>, %RSP<imp-def>, %XMM0<imp-def><br> 1 ADJCALLSTACKUP64 0, 0, %RSP<imp-def,dead>, %EFLAGS<imp-def,dead>, %RSP<imp-use><br> 1 %vreg0<def> = COPY %XMM0; FR64:%vreg0<br> 0 ADJCALLSTACKDOWN64 0, 0, %RSP<imp-def,dead>, %EFLAGS<imp-def,dead>, %RSP<imp-use><br> 0 CALL64pcrel32 <ga:@bar>, <regmask ...>, %RSP<imp-use>, %RSP<imp-def>, %XMM0<imp-def><br> 1 ADJCALLSTACKUP64 0, 0, %RSP<imp-def,dead>, %EFLAGS<imp-def,dead>, %RSP<imp-use><br> 1 %vreg1<def> = COPY %XMM0; FR64:%vreg1<br> 0 ADJCALLSTACKDOWN64 0, 0, %RSP<imp-def,dead>, %EFLAGS<imp-def,dead>, %RSP<imp-use><br> 0 CALL64pcrel32 <ga:@bar>, <regmask ...>, %RSP<imp-use>, %RSP<imp-def>, %XMM0<imp-def><br> 1 ADJCALLSTACKUP64 0, 0, %RSP<imp-def,dead>, %EFLAGS<imp-def,dead>, %RSP<imp-use><br> 1 %vreg2<def> = COPY %XMM0; FR64:%vreg2<br> 0 ADJCALLSTACKDOWN64 0, 0, %RSP<imp-def,dead>, %EFLAGS<imp-def,dead>, %RSP<imp-use><br> 0 CALL64pcrel32 <ga:@bar>, <regmask ...>, %RSP<imp-use>, %RSP<imp-def>, %XMM0<imp-def><br> 1 ADJCALLSTACKUP64 0, 0, %RSP<imp-def,dead>, %EFLAGS<imp-def,dead>, %RSP<imp-use><br> 1 %vreg3<def> = COPY %XMM0; FR64:%vreg3<br> 1 %vreg4<def> = VADDSDrr %vreg0, %vreg1; FR64:%vreg4,%vreg0,%vreg1<br> 1 %vreg5<def> = VADDSDrr %vreg2, %vreg3; FR64:%vreg5,%vreg2,%vreg3<br> 4 %vreg6<def> = VADDSDrr %vreg4<kill>, %vreg5<kill>; FR64:%vreg6,%vreg4,%vreg5<br> 7 %XMM0<def> = COPY %vreg6; FR64:%vreg6<br> 7 RETQ %XMM0<br><br>And this is after:<br><br> Depths for BB#0:<br> 0 Instructions<br> 0<span class="Apple-tab-span" style="white-space:pre"> </span>ADJCALLSTACKDOWN64 0, 0, %RSP<imp-def,dead>, %EFLAGS<imp-def,dead>, %RSP<imp-use><br> 0<span class="Apple-tab-span" style="white-space:pre"> </span>CALL64pcrel32 <ga:@bar>, <regmask ...>, %RSP<imp-use>, %RSP<imp-def>, %XMM0<imp-def><br> 1<span class="Apple-tab-span" style="white-space:pre"> </span>ADJCALLSTACKUP64 0, 0, %RSP<imp-def,dead>, %EFLAGS<imp-def,dead>, %RSP<imp-use><br> 1<span class="Apple-tab-span" style="white-space:pre"> </span>%vreg0<def> = COPY %XMM0; FR64:%vreg0<br> 0<span class="Apple-tab-span" style="white-space:pre"> </span>ADJCALLSTACKDOWN64 0, 0, %RSP<imp-def,dead>, %EFLAGS<imp-def,dead>, %RSP<imp-use><br> 0<span class="Apple-tab-span" style="white-space:pre"> </span>CALL64pcrel32 <ga:@bar>, <regmask ...>, %RSP<imp-use>, %RSP<imp-def>, %XMM0<imp-def><br> 1<span class="Apple-tab-span" style="white-space:pre"> </span>ADJCALLSTACKUP64 0, 0, %RSP<imp-def,dead>, %EFLAGS<imp-def,dead>, %RSP<imp-use><br> 1<span class="Apple-tab-span" style="white-space:pre"> </span>%vreg1<def> = COPY %XMM0; FR64:%vreg1<br> 0<span class="Apple-tab-span" style="white-space:pre"> </span>ADJCALLSTACKDOWN64 0, 0, %RSP<imp-def,dead>, %EFLAGS<imp-def,dead>, %RSP<imp-use><br> 0<span class="Apple-tab-span" style="white-space:pre"> </span>CALL64pcrel32 <ga:@bar>, <regmask ...>, %RSP<imp-use>, %RSP<imp-def>, %XMM0<imp-def><br> 1<span class="Apple-tab-span" style="white-space:pre"> </span>ADJCALLSTACKUP64 0, 0, %RSP<imp-def,dead>, %EFLAGS<imp-def,dead>, %RSP<imp-use><br> 1<span class="Apple-tab-span" style="white-space:pre"> </span>%vreg2<def> = COPY %XMM0; FR64:%vreg2<br> 0<span class="Apple-tab-span" style="white-space:pre"> </span>ADJCALLSTACKDOWN64 0, 0, %RSP<imp-def,dead>, %EFLAGS<imp-def,dead>, %RSP<imp-use><br> 0<span class="Apple-tab-span" style="white-space:pre"> </span>CALL64pcrel32 <ga:@bar>, <regmask ...>, %RSP<imp-use>, %RSP<imp-def>, %XMM0<imp-def><br> 1<span class="Apple-tab-span" style="white-space:pre"> </span>ADJCALLSTACKUP64 0, 0, %RSP<imp-def,dead>, %EFLAGS<imp-def,dead>, %RSP<imp-use><br> 1<span class="Apple-tab-span" style="white-space:pre"> </span>%vreg3<def> = COPY %XMM0; FR64:%vreg3<br> 1<span class="Apple-tab-span" style="white-space:pre"> </span>%vreg5<def> = VADDSDrr %vreg2, %vreg3; FR64:%vreg5,%vreg2,%vreg3<br> 4<span class="Apple-tab-span" style="white-space:pre"> </span>%vreg7<def> = VADDSDrr %vreg1, %vreg5<kill>; FR64:%vreg7,%vreg1,%vreg5<br> 7<span class="Apple-tab-span" style="white-space:pre"> </span>%vreg6<def> = VADDSDrr %vreg0, %vreg7<kill>; FR64:%vreg6,%vreg0,%vreg7<br> 10<span class="Apple-tab-span" style="white-space:pre"> </span>%XMM0<def> = COPY %vreg6; FR64:%vreg6<br> 10<span class="Apple-tab-span" style="white-space:pre"> </span>RETQ %XMM0<br><br>Do you agree that this 2nd sequence has been de-optimized?<br></blockquote><br>The crux is that the trace metrics assumes all calls can execute in parallel. That looks like an unrealistic assumption. This suggests there is a deeper modeling problem/question we need to understand first before jumping into solution space.<br><br><blockquote type="cite">As noted earlier, the reason this happens is because the MTM *Height* information for the original sequence looks like this:<br><br> Heights for BB#0:<br> 16 Instructions<br> 3c @ SBPort1 (3 ops x12)<br> 5c @ SBPort5 (5 ops x12)<br> 3c @ SBPort05 (5 ops x6)<br> 4c @ SBPort15 (8 ops x6)<br> 1c @ SBPort23 (1 ops x6)<br> 3c @ SBPort015 (8 ops x4)<br> 2c @ SBPortAny (9 ops x2)<br> 7 0 RETQ %XMM0<br> 7 0 %XMM0<def> = COPY %vreg6; FR64:%vreg6<br> 7 3 %vreg6<def> = VADDSDrr %vreg4<kill>, %vreg5<kill>; FR64:%vreg6,%vreg4,%vreg5<br> 7 6 %vreg5<def> = VADDSDrr %vreg2, %vreg3; FR64:%vreg5,%vreg2,%vreg3<br> 7 6 %vreg4<def> = VADDSDrr %vreg0, %vreg1; FR64:%vreg4,%vreg0,%vreg1<br> 7 6 %vreg3<def> = COPY %XMM0; FR64:%vreg3<br> 7 0 ADJCALLSTACKUP64 0, 0, %RSP<imp-def,dead>, %EFLAGS<imp-def,dead>, %RSP<imp-use><br> 7 7 CALL64pcrel32 <ga:@bar>, <regmask ...>, %RSP<imp-use>, %RSP<imp-def>, %XMM0<imp-def><br> 8 8 ADJCALLSTACKDOWN64 0, 0, %RSP<imp-def,dead>, %EFLAGS<imp-def,dead>, %RSP<imp-use><br> 8 6 %vreg2<def> = COPY %XMM0; FR64:%vreg2<br> 10 9 ADJCALLSTACKUP64 0, 0, %RSP<imp-def,dead>, %EFLAGS<imp-def,dead>, %RSP<imp-use><br> 10 10 CALL64pcrel32 <ga:@bar>, <regmask ...>, %RSP<imp-use>, %RSP<imp-def>, %XMM0<imp-def><br> 11 11 ADJCALLSTACKDOWN64 0, 0, %RSP<imp-def,dead>, %EFLAGS<imp-def,dead>, %RSP<imp-use><br> 11 6 %vreg1<def> = COPY %XMM0; FR64:%vreg1<br> 13 12 ADJCALLSTACKUP64 0, 0, %RSP<imp-def,dead>, %EFLAGS<imp-def,dead>, %RSP<imp-use><br> 13 13 CALL64pcrel32 <ga:@bar>, <regmask ...>, %RSP<imp-use>, %RSP<imp-def>, %XMM0<imp-def><br> 14 14 ADJCALLSTACKDOWN64 0, 0, %RSP<imp-def,dead>, %EFLAGS<imp-def,dead>, %RSP<imp-use><br> 14 6 %vreg0<def> = COPY %XMM0; FR64:%vreg0<br> 16 15 ADJCALLSTACKUP64 0, 0, %RSP<imp-def,dead>, %EFLAGS<imp-def,dead>, %RSP<imp-use><br> 16 16 CALL64pcrel32 <ga:@bar>, <regmask ...>, %RSP<imp-use>, %RSP<imp-def>, %XMM0<imp-def><br> 17 17 ADJCALLSTACKDOWN64 0, 0, %RSP<imp-def,dead>, %EFLAGS<imp-def,dead>, %RSP<imp-use><br> BB#0 Live-ins: SPL@17<br> Critical path: 17<br><br><br>http://reviews.llvm.org/D13417<br><br><br><br></blockquote><br></body></html>