[PATCH] D52997: [x86] allow single source horizontal op matching (PR39195)

Tue Oct 9 03:23:10 PDT 2018

andreadb added a comment.

Hi Sanjay,

On Jaguar only, those unary HADD are going to be as fast as the SHUFFLE+ADD sequence.
In terms of overall latency, both sequences are pretty much equivalent.

HADD has a worse throughput than the SHUFFLE+ADD sequence (1 IPC) mainly because it can only execute on pipe0. SHUFFLE+ADD gives more flexibility to the HW scheduler.
The biggest advantage on Jaguar is that HADD is not microcoded. The XMM variant is fast-path single, which allows us to achieve a better throughput from the decoders (w.r.t. the SHUFFLE+ADD).

I don't see it as a big problem if we start "regressing" this particular case on Jaguar.

I don't have a problem with aggressively selecting HADD at ISel stage, provided that we "undo" that canonicalization in a later (machine combiner?) pass.
Using HADD is not just slow for Intel, it is going to be slow for other AMD processors too. Similarly to what we do for other instructions (CMOV/LEA) which may be further expanded later on.

The problem with having a rule in the machine combiner is that we need to account for register pressure and block frequency too. Essentially, we need a (not too trivial) cost model there; simply comparing code snippets in term of throughput and latency is probably not enough at that stage.
We could have a post-RA pass (before we run the post-RA scheduler) that decides when it is profitable to revert the HADD canonicalization and expand it back to a shuffle+add.

Just my opinion.

https://reviews.llvm.org/D52997