[llvm-dev] [GlobalISel] Legalize generic instructions that also depend on type of scalar, not only scalar size

Fri Sep 28 11:34:39 PDT 2018

> On 24 Sep 2018, at 08:05, Petar Avramovic <Petar.Avramovic at rt-rk.com> wrote:
> 
> Hi,
> 
> On 21.09.2018. 17:50, Daniel Sanders wrote:
>> For G_LOAD/G_STORE it should be ok to use double load/store instructions to load/store i64 values. You just end up with a copy between fprs and gprs if you need to perform an operation that the fpu can't do.
> Since all mentioned instructions for i64 just move bits around, available 64bit fpr instructions could be used. I thought that this approach would result in a few extra move instructions but in order to make it work there seem to be some additional checks and post legalizer legalization. This don't go along with my understanding of GlobalISel pipeline, particularly emulating i64 for mips32.
> I have encountered following questions in my attempt to legalize i64 instructions.
> How do we know if s64 was i64, not double, and we need to copy to gprs?

We don't know if the registers hold i64 or double but we don't need to. The legalizer is only concerned with whether the given code is selectable. RegBankAlloc chooses where the values are held and will insert the copies if necessary.

> How will this copy fold into i32 instructions that emulate i64 instruction?
> Doubles have 64 bit register class (afgr64 or fgr64) while i64 need to be emulated with two gpr32 registers. This copy would require some additional checks in order to be selected into two move instructions, I am not entirely sure where this checks should happen.
> Also what register classes would be used for operands of such copy?

Suppose we have:
	%0(s64) = G_LOAD ...
	%1(s64) = G_LOAD ...
	%2(s64) = G_OR %0, %1
	G_STORE %2(s64), ...
and 's64 = G_OR s64, s64' is the only operation that requires legalization. We need to narrow-scalar it to s32. The legalizer should produce:
	%0(s64) = G_LOAD ...
	%1(s64) = G_LOAD ...
	%3(s32), %4(s32) = G_UNMERGE_VALUES %0(s64)
	%5(s32), %6(s32) = G_UNMERGE_VALUES %1(s64)
	%7(s32) = G_OR %3, %5
	%8(s32) = G_OR %4, %6
	%9(s64) = G_MERGE_VALUES %7(s32), %8(s32)
	G_STORE %9(s64), ...
The G_ADD requires GPR and the G_LOAD/G_STORE requires FPR so RegBankSelect should bind it to register banks like so:
	%0:FPR(s64) = G_LOAD ...
	%1:FPR(s64) = G_LOAD ...
	%3:GPR(s32), %4:GPR(s32) = G_UNMERGE_VALUES %0:FPR(s64)
	%5:GPR(s32), %6:GPR(s32) = G_UNMERGE_VALUES %1:FPR(s64)
	%7:GPR(s32) = G_OR %3, %5
	%8:GPR(s32) = G_OR %4, %6
	%9:FPR(s64) = G_MERGE_VALUES %7:GPR(s32), %8:GPR(s32)
	G_STORE %9:FPR(s64), ...
the copies would be emitted by the ISel for G_UNMERGE_VALUES (using mfhc1/mfc1) and G_MERGE_VALUES (using mthc1/mtc1). On Mips, with a 32-bit FPU this would be selected to:
	%0:AFGR64(s64) = LDC1 ...
	%1:AFGR64(s64) = LDC1 ...
	%10:FGR32(s32) = EXTRACT_SUBREG %0:AFGR64(s64), lo_subreg
	%11:FGR32(s32) = EXTRACT_SUBREG %0:AFGR64(s64), hi_subreg
	%12:FGR32(s32) = EXTRACT_SUBREG %1:AFGR64(s64), lo_subreg
	%13:FGR32(s32) = EXTRACT_SUBREG %1:AFGR64(s64), hi_subreg
	%3:GPR32(s32) = MFC1 = %10:FGR32(s32)
	%4:GPR32(s32) = MFHC1 = %11:FGR32(s32)
	%5:GPR32(s32) = MFC1 = %12:FGR32(s32)
	%6:GPR32(s32) = MFHC1 = %13:FGR32(s32)
	%7:GPR(s32) = OR %3, %5
	%8:GPR(s32) = OR %4, %6
	%14:FGR32(s32) = MTC1 = %7:GPR32(s32)
	%15:FGR32(s32) = MTHC1 = %8:GPR32(s32)
	%9:AFGR64(s64) = REG_SEQUENCE %14, lo_subreg, %15, hi_subreg
	G_STORE %9:FPR(s64), ...

The end result of this is 11 instructions (2x ldc1, 2x mfc1, 2x mfhc1, 2x or, 1x mtc1, 1x mthc1, 1x sdc1) which is actually worse than the alternative 8 instructions (4x lw, 2x or, 2x sw) even when accounting for latency. So how do we produce the better code?

The key is in RegBankSelect. Its job is to bind generic vregs to register banks in a way that is the most efficient for the data flow. It should be aware that GPR <-> AFGR copies cost 2 instructions (3 cycles) and that an s64 load can be done as an either s64 load or 2x s32 loads should weigh up whether it's better to do 64-bit load/store and bind to FPR --> GPR --> FPR or break it into 32-bit load/store and bind to GPR --> GPR --> GPR. This is one of the reasons the bulk of the legalization code is in LegalizeHelper. If RegBankAlloc decides it's better to bind the load to the GPR register bank then it can call LegalizeHelper's methods to narrowScalar the vreg to something compatible with the GPR register bank.

>> Could you elaborate on the issue with G_SELECT, G_EXTRACT, and G_INSERT?
> Issues with G_LOAD, G_STORE, G_SELECT, G_EXTRACT, and G_INSERT are similar and the problem is my approach to legalization of i64 instructions.
> In mips32 td files there are no register classes that can hold i64 nor pseudoinstructions that work with i64. For that reason I consider i64 instructions "not legal".  Because of that my approach was to handle i64 instructions in legalizer, where they are replaced with a sequence of i32 instructions that work with high and low bits (held in s32 virtual registers)  or replaced with a libcall. Also s64 vregs that are "def" merge two s32 vregs (high and low bits) with G_MERGE_VALUES artifact. All "uses" of this s64 register will have to unmerge it and define two s32 vregs with G_UNMERGE_VALUES artifact. After all instructions have been legalized LegalizationArtifactCombiner will combine  G_MERGE_VALUES with G_UNMERGE_VALUES and there will be no more s64 vregs that used to hold i64.
> 
> Here is the llvm-ir input:
> 
> define i64 @f(i64 %a, i64* %b){
> entry:
>   %0 = load i64, i64* %b, align 8
>   %add = add nsw i64 %0, %a
>   ret i64 %add
> }
> 
> After IRTranslator we get this mir as input for legalizer pass
> 
> liveins: $a0, $a1, $a2
> %2:_(s32) = COPY $a0
> %3:_(s32) = COPY $a1
> %0:_(s64) = G_MERGE_VALUES %3(s32), %2(s32)
> %1:_(p0) = COPY $a2
> %4:_(s64) = G_LOAD %1(p0) :: (load 8 from %ir.b)
> %5:_(s64) = G_ADD %4, %0
> %6:_(s32), %7:_(s32) = G_UNMERGE_VALUES %5(s64)
> $v0 = COPY %7(s32)
> $v1 = COPY %6(s32)
> RetRA implicit $v0, implicit $v1
> 
> This is the state of our function after legalization of s64 G_LOAD and G_ADD, we are now going to combine G_MERGE_VALUES with G_UNMERGE_VALUES
> 
> liveins: $a0, $a1, $a2
> %2:_(s32) = COPY $a0
> %3:_(s32) = COPY $a1
> %0:_(s64) = G_MERGE_VALUES %3:_(s32), %2:_(s32)
> %1:_(p0) = COPY $a2
> %16:_(s32) = G_LOAD %1:_(p0) :: (load 8 from %ir.b)

Something strange has happened here since the memory size isn't consistent with the container. I don't think that's the problem though.

I think you should either have :
	%16:_(s64) = G_LOAD %1:_(p0) :: (load 8 from %ir.b)
or:
	%30:_(s32) = G_LOAD %1:_(p0) :: (load 4 from %ir.b)
	%31:_(s32) = G_LOAD %1:_(p0) :: (load 4 from %ir.b + 4)

> %19:_(s32) = G_CONSTANT i32 4
> %18:_(p0) = G_GEP %1:_, %19:_(s32)
> %17:_(s32) = G_LOAD %18:_(p0) :: (load 8 from %ir.b)

Likewise but %ir.b is also incorrect. It should be %ir.b + 4

> %4:_(s64) = G_MERGE_VALUES %17:_(s32), %16:_(s32)
> %9:_(s32), %8:_(s32) = G_UNMERGE_VALUES %0:_(s64)
> %11:_(s32), %10:_(s32) = G_UNMERGE_VALUES %4:_(s64)
> %15:_(s32) = G_ADD %11:_, %9:_
> %12:_(s32) = G_ADD %10:_, %8:_
> %14:_(s32) = G_ICMP intpred(ult), %12:_(s32), %10:_
> %13:_(s32) = G_ADD %15:_, %14:_
> %5:_(s64) = G_MERGE_VALUES %13:_(s32), %12:_(s32)
> %6:_(s32), %7:_(s32) = G_UNMERGE_VALUES %5:_(s64)
> $v0 = COPY %7:_(s32)
> $v1 = COPY %6:_(s32)
> RetRA implicit $v0, implicit $v1
> 
> and this is the legalizer output:
> 
> liveins: $a0, $a1, $a2
> %2:_(s32) = COPY $a0
> %3:_(s32) = COPY $a1
> %1:_(p0) = COPY $a2
> %16:_(s32) = G_LOAD %1(p0) :: (load 8 from %ir.b)
> %19:_(s32) = G_CONSTANT i32 4
> %18:_(p0) = G_GEP %1, %19(s32)
> %17:_(s32) = G_LOAD %18(p0) :: (load 8 from %ir.b)
> %15:_(s32) = G_ADD %17, %3
> %12:_(s32) = G_ADD %16, %2
> %14:_(s32) = G_ICMP intpred(ult), %12(s32), %16
> %13:_(s32) = G_ADD %15, %14
> $v0 = COPY %12(s32)
> $v1 = COPY %13(s32)
> RetRA implicit $v0, implicit $v1
> 
> As we can see there are no s64 vregs that hold integers after legalizer. This approach works as long as we know types of operands (s64 in G_ADD is i64, s64 in G_FADD is double).

I disagree with this sentence, it works even if you don't know the format of the data. Similarly to G_SDIV and G_UDIV not requiring the operands to know their signedness, operations don't require their operands to know their data format. It works because IRTranslator always emits a format-changing operation (G_*EXT/G_*TRUNC/G_FPTOINT/G_INTTOFP/G_BITCAST/etc) so you know that operators always get their expected format.

> With described approach there are issues if we don't know the type of operands and declare generic instruction legal (it should not be legal if we knew the type). Then, we will be missing a G_MERGE_VALUES (G_LOAD and G_EXTRACT) or G_UNMERGE_VALUES (G_STORE and G_INSERT) or both (G_SELECT). After legalizer there will be a few s64 vregs that hold i64 and few  G_MERGE_VALUES or G_UNMERGE_VALUES that were not combined. I would consider this to be legalization error.

It's not a legalization error to have s64 vregs. There are registers that can hold 64-bits and those registers don't care what the bits mean. The only requirement is that it's either fed into an s64-consuming operation or the type is changed (e.g. via G_UNMERGE_VALUES) and fed to operations that can consume that type.

> Perhaps i misunderstood the purpose of legalizer, but why it is possible to determine type of operands for some generic instructions and not possible for the others?

I'm not sure what you mean here so I might be off-track but... LLT's don't really convey the type of the data but rather the requirements on the container. So a p0 vreg has to be capable of holding the bits for a pointer to address space 0, an s32 vreg has to be capable of holding 32 bits, and a v2s64 has to be capable of holding two s64's.  The reason scalars and vectors are distinct is that v2s7 could be s14, s16, s32, etc. depending on how a target prefers to lay out vector registers.

> How could such typeless instruction(that is not legal for all types) fold (in Legalizer) into other instructions that use legalization artifacts to become legal?

I think I've covered near the top of this email. Let me know if I haven't.

> Is there some approach in which we can avoid use of artifacts for i64?

The artifacts are a consequence of a desire to keep the MIR valid after every step of the legalization process. They aren't avoidable without compromising that.

>>> On 21 Sep 2018, at 02:28, Petar Avramovic <Petar.Avramovic at rt-rk.com> wrote:
>>> 
>>> Hi,
>>> 
>>> Mips32 has 64 bit floating point instructions, while i64 instructions have to be emulated with i32 instructions. This means that G_LOAD should be custom legalized for s64 integer value, and be legal for s64 floating point value. There are also other generic instructions with the same problem: G_STORE, G_SELECT, G_EXTRACT, and G_INSERT.
>>> 
>>> There are also other configurations where integer and floating point instructions of the same size are not simultaneously available. This problem was already addressed here http://lists.llvm.org/pipermail/llvm-dev/2017-July/114978.html for G_LOAD/G_STORE.
>>> 
>>> Legality of an instruction should not depend on surrounding instructions. Because of that, approach from regbankselect that iterates through uses of G_LOAD def register and checks if some of the uses was an generic floating point instruction should not be an option for legalizer.
>>> 
>>> Since GlobalISel Legalizer cannot distinguish between them using only LLT, only other option that I can see at this moment is having "F" variant of the generic instruction (like this is the case with G_ADD and G_FADD).
>>> 
>>> Is this change possible? Or are there some other approaches for dealing with this problem?
>>> 
>>> Petar
>>> 
>>> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20180928/8c1769f7/attachment-0001.html>