[PATCH] D46494: [DAGCombiner] Masked merge: enhance handling of 'andn' with immediates

Mon May 7 09:26:42 PDT 2018

andreadb added a comment.

In https://reviews.llvm.org/D46494#1089720, @spatel wrote:

> In https://reviews.llvm.org/D46494#1089380, @lebedev.ri wrote:
>
> > Fixed with the correct fold, updated mca diffs in the differential's description:
> >  F6120274: trunk-vs-patch.txt <https://reviews.llvm.org/F6120274> F6120277: prevpatch-vs-patch.txt <https://reviews.llvm.org/F6120277>
>
>
> We need to be careful here (and maybe there's a way for mca to show/warn about this, cc @andreadb).

At the moment, mca doesn't warn you about cross-iteration dependencies.
When micro-benchmarking, it is up to the user to make sure that it doesn't negatively impact the analysis. The timeline view makes it easier to catch these situations.

> When you simulate these instructions:
> 
>   andnl %edx, %edi, %eax
>   orl $42, %edx
>   andnl %edx, %eax, %eax
>    
> 
> Notice that the output of the sequence (%eax) is not used again by any instruction in the sequence. So this is measuring ideal throughput in a vacuum - each simulated iteration proceeds independently. Maybe that's what you intended, but the original sequence that you're comparing does not have that property:
> 
>   andl	%edx, %edi
>   notl	%edx
>   andl	$42, %edx
>   orl	%edi, %edx    <--- output fed back as the input to first instruction 
>    
> 
> Each iteration depends on the previous one, so it's not fair to compare the stats for the 2 sequences as they're shown in the attached diff.

I agree with Sanjay.

I also noticed that you often test for ryzen processors. It is also interesting to see what happens on processors with a smaller issue width.
You can manually change the second ANDN from the optimized sequence so that it updates %edx. That would make the two code snippets "sort-of" comparable in term of data dependencies.

With that change, I get that IPC is almost the same. The ANDNL sequence is slightly better, and consumes less cycles mainly because there is one instruction less to execute every iteration. Overall, on btver2, we go from IPC 1.33, to IPC 1.50.

The resource pressure distribution was already optimal in the original case.
The main advantage is that ANDN uses a VEX prefix, and therefore it allows encoding three register operands. That gives a bit more flexibility to the register allocator: the compiler can remove a register dependency at the cost of an extra register use (and a few more bytes in the instruction encoding). Speaking about instruction encoding: the ANDN is 5 bytes (instead of 2 bytes AND), so we go from a total of 9 bytes to a total of 13 bytes for the full sequence. With SimonP, we were thinking about adding instruction encoding information to the llvm-mca output.
If the code is optimized for minsize, then it may be worthy to generate the original sequence with two-address instructions only. It may be less optimal for data-dependencies, but it uses shorter encodings. Not sure if it matters that much though.

Repository:
  rL LLVM

https://reviews.llvm.org/D46494