[PATCH] D104853: [X86] Add description of FXAM instruction

Thu Jun 24 12:19:05 PDT 2021

sepavloff added a comment.

In D104853#2838990 <https://reviews.llvm.org/D104853#2838990>, @craig.topper wrote:

> In D104853#2838943 <https://reviews.llvm.org/D104853#2838943>, @sepavloff wrote:
>
>> In D104853#2838797 <https://reviews.llvm.org/D104853#2838797>, @craig.topper wrote:
>>
>>> FXAM appears to be two uops where FTST is one on modern Intel CPUs based on Agner Fog's data. Agner's data for some AMD CPUs shows ~20 cycles of latency.
>>
>> Could tuning scheduling for this instruction be subsequent work?
>
> Yes, but that 20 cycle latency on some AMD CPUs is a little concerning. X87 tends to get more and more unoptimized in modern CPUs and I'm sure what instructions show up in code factors in to those design decisions. So using an X87 instruction that compilers haven't historically used could expose unexpected performance issues.

The table of FXAM properties built from Agner Fog data:

| Core            | N uops | Latency | Rec. Throughput |
| K7               | 2      |         | 2               |
| K8               | 2      |         | 1               |
| K10               | 2      |         | 1               |
| Bulldozer       | 1      | 20      | 0.5             |
| Piledriver      | 1      | 20      | 0.5             |
| Steamroller     | 1      | 26      | 0.5             |
| Excavator       | 1      | 26      | 0.5             |
| Zen1            | 1      |         | 1               |
| Zen2            | 1      |         | 1               |
| Zen3            | 1      |         | 0.5             |
| Bobcat          | 2      |         | 2               |
| Jaguar          | 2      |         | 2               |
| Pentium         |        | 17-21   |                 |
| Pentium 2,3     | 1      | 2       |                 |
| Pentium M       | 1      |         | 1               |
| Merom           | 1      |         | 1               |
| Wolfdale        | 1      |         | 1               |
| Nehalem         | 1      |         | 1               |
| Sandy Bridge    | 2      |         | 2               |
| Ivy Bridge      | 2      |         | 2               |
| Haswell         | 2      |         | 2               |
| Broadwell       | 2      | 6       | 2               |
| SkyLake         | 2      | 6       | 2               |
| SkylakeX        | 2      | 6       | 2               |
| Coffee Lake     | 2      | 6       | 2               |
| Ice Lake        | 2      | 6       | 2               |
| Pentium 4       | 1      | 2       | 1               |
| Prescott        | 1      |         | 1               |
| Atom            | 1      | 1       | 1               |
| Silvermont      | 1      | 7       | 1               |
| Goldmont        | 1      |         | 1               |
| Goldmont+       | 1      |         | 1               |
| Goldmont        | 1      |         | 1               |
| Knights Landing | 1      |         | 1               |
| VIA Nano 2000   |        |         | 41              |
| Nano 3000       | 15     | 38      | 38              |
|

Number of uops vary mostly in range 1-2.

For operations on fp80 this instruction is still better than emulation, this is the main motivation for using it. Probably it is also can be useful if SSE is unavailable.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D104853/new/

https://reviews.llvm.org/D104853