[PATCH] D60441: [X86] Make _Int instructions the preferred instructon for the assembly parser and disassembly parser to remove inconsistencies between VEX and EVEX.

Wed Apr 10 04:59:18 PDT 2019

andreadb accepted this revision.
andreadb added a comment.
This revision is now accepted and ready to land.

In D60441#1460184 <https://reviews.llvm.org/D60441#1460184>, @andreadb wrote:

> In D60441#1460147 <https://reviews.llvm.org/D60441#1460147>, @craig.topper wrote:
>
> > Architecturally that read really does exist. Its not a false dependency. That read defines the upper bits of the result. The fact that AVX and SSE were different before this patch seems like a bug. It looks like with your proposed change they would still be different. That doesn't seem right.
>
>
> Right. Sorry. The upper bits of the result are unmodified for the SSE variants. So yes, it was a bug before, and that read does exist in practice.

So, I found out the reason why the modified BtVer2 test was expecting a higher IPC.
I am specifically talking about the code comment from that test that says ```# Throughput for the SSE code snippets below should tend to 1.00 IPC.```

The microbenchmark which was used to measure the actual throughput on the target was pre-initializing XMM0 to all-zeroes.
It shouldn't have done that because AMD processors (at least since AMDFam15h) implement a "register merge optimization" based on the knowledge of zero bits in XMM registers (see below).

Quoting the AMDFam16h SOG:

  2.11 XMM Register Merge Optimization
  The AMD Family 16h processor implements an XMM register merge optimization.
  The processor keeps track of XMM registers whose upper portions have been cleared to zeros. This information
  can be followed through multiple operations and register destinations until non-zero data is written into a
  register. For certain instructions, this information can be used to bypass the usual result merging for the upper
  parts of the register.

Instruction CVTSI2SS and CVTSI2SD are listed by that document as instructions that can benefit from that register merge optimization.

So... I have rerun my original microbenchmark. This time, I made sure not to set XMM0 to all-zeroes.
This is what I've got:

  cycles:           105415644                                       ( +- 0.20% )
  instructions:     26000303         #   0.25 insn per cycle        ( +- 0.00% )
  micro-opcodes:    51640304         #   0.49 uOps per cycle        ( +- 0.01% )

So, yes. The test should not have expected a 1.00 IPC. It was a bug.
It should have been 0.25 IPC instead (which is what we would get with your patch).

---

I also noticed that this same optimization is done by Fam15h processors (so, it applies to Piledriver). That same paragraph can be found in `AMD Fam15h SOG - Section 5.5 Partial-Register Writes`.

@lebedev.ri can probably verify those numbers for BdVer2.

Again, sorry for the confusion caused by my previous post. I was trying to be useful but I failed... (I keep forgetting about those SSE partial writes).

This patch looks good to me. Hopefully Roman will be able to verify those numbers for BdVer2.

-Andrea

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D60441/new/

https://reviews.llvm.org/D60441