[PATCH] D129715: [LoongArch] Heuristically load FP immediates by movgr2fr from materialized integer

Fri Jul 15 03:26:32 PDT 2022

xen0n added a comment.

In D129715#3654388 <https://reviews.llvm.org/D129715#3654388>, @gonglingqin wrote:

> In D129715#3653907 <https://reviews.llvm.org/D129715#3653907>, @xen0n wrote:
>
>> In D129715#3653900 <https://reviews.llvm.org/D129715#3653900>, @gonglingqin wrote:
>>
>>> I used 3A5000 on llvm13 to test materialized integer within 1,2 and 4 instructions.the results show that the performance is the best when using no more than 2 instructions. Maybe we should test the situation materialized integer within 3 instructions.
>>
>> Could be better to find some time to upgrade your benchmarking environment for testing the actual main branch. ;-)
>>
>> Regarding the actual benchmarks, yes I think testing the 3-instruction case could be useful. But again, it may not make a significant difference, since the IEEE-754 biased exponent is occupying the highest 12 bits (except the sign bit), all f64's with top 12 bits zeroed are denormals. And numbers whose binary representation have big "holes" of all-0s or 1s for their two "middle"  20-bit segments or lowest 12 bits are probably not commonly used in the wild, let alone being used as immediates. You could try benchmarking of course, but I doubt the result would be much different from the 2-insn case.
>>
>> (The 4-insn case is useless and equivalent to unconditionally loading via integer immediates, because all 64-bit values can be loaded in 4 insns (`lu12i.w + ori + lu32i.d + lu52i.d`) in LA64, and in LA32 you need two pairs of materialization and GPR-FPR moves for the higher and lower 32 bits anyway.)
>
> The test results show that the performance of materialized integer within 3 instructions is better than that of the 2-instructions case. The test results are shown in the table
>
> | Benchmarks  | Score of 2 instructions case | Score of 3 instructions case | diff |
> | 433.milc    | 13.2                         | 13.2                         | 0    |
> | 444.namd    | 15                           | 15.1                         | 0.1  |
> | 447.dealII  | 26.6                         | 26.7                         | 0.1  |
> | 450.soplex  | 23.6                         | 24.2                         | 0.6  |
> | 453.povray  | 23.3                         | 23.4                         | 0.1  |
> | 470.lbm     | 21.5                         | 21.9                         | 0.4  |
> | 482.sphinx3 | 25.5                         | 25.5                         | 0    |
> |
>
> It seems that 3-instructions case outperforms the other cases. @xen0n, Do you have any suggestions?
> (Since we do not support flang for the time being, I didn't test fortran related topics)

This is interesting data, is the SPEC2006 runs one-shot or averaged over multiple runs like the Phoronix Test Suite? Although the 450.soplex case seems statistically significant enough.

I think some assembly comparison could go a long way, but again, SPEC2006 is *horribly outdated* so actually IMO the argument for 3-instruction threshold would be a lot stronger if you could replicate this result on some more recent or comprehensive benchmark suites. (PTS or newer SPEC are all better than SPEC2006 in this regard.)

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D129715/new/

https://reviews.llvm.org/D129715