[PATCH] D118020: [RISCV] Set CostPerUse for floating point registers

Fri Feb 18 08:54:50 PST 2022

craig.topper added a comment.

In D118020#3331582 <https://reviews.llvm.org/D118020#3331582>, @pcwang-thead wrote:

> In D118020#3329779 <https://reviews.llvm.org/D118020#3329779>, @asb wrote:
>
>> I'm surprised this resulted in performance increases. I might have guessed that with so few FP instructions being compressible, the further constraint on register selection might be more likely to result in a (slight) decrease in performance. Shows the value of running the benchmarks!
>>
>> I've put this patch on the agenda for the RISC-V LLVM call today, but based on the data so far this seems to make sense.
>
> I am surprised too.
>
> IMO, there is a possible reason that may explain the performance increases:
>
> - When register number is in [8, 15], instructions can be compressed.
> - For the first 16 integer registers, registers x0-x4(and sometimes x5) are reserved for special usage, and the register allocation orders are like below:
>
>   def GPR : RegisterClass<"RISCV", [XLenVT], 32, (add
>       (sequence "X%u", 10, 17),
>       (sequence "X%u", 5, 7),
>       (sequence "X%u", 28, 31),
>       (sequence "X%u", 8, 9),
>       (sequence "X%u", 18, 27),
>       (sequence "X%u", 0, 4)
>     )> {
>     let RegInfos = XLenRI;
>   }
>
> which means we will allocates most RVC integer registers first.
> So, for most programs, there is minimal difference whether we set `CostPerUse` to `0` or `1`.
>
> - For the first 16 float registers, there is no reserved register, and the register allocation orders are like below:
>
>   def FPR32 : RegisterClass<"RISCV", [f32], 32, (add
>       (sequence "F%u_F", 0, 7),
>       (sequence "F%u_F", 10, 17),
>       (sequence "F%u_F", 28, 31),
>       (sequence "F%u_F", 8, 9),
>       (sequence "F%u_F", 18, 27)
>   )>;
>
> which means we will allocates temporary float registers first and most float instructions can't be compressed.
> So when we set `CostPerUse` to `1`, a lot of float instructions can be compressed, which results in improvements on icache misses.

Our confusion is that there are only 8 float opcodes that have compressed forms. They are all loads/stores and 4 of them are limited to RV32.

I ran 453.povray from SPEC2006 on a SiFive Unmatched board with this patch applied to our downstream compiler. My result was a 1% decrease in performance.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D118020/new/

https://reviews.llvm.org/D118020