[llvm-dev] Dealing with boolean values in GlobalISel

Mon Sep 30 12:53:45 PDT 2019

> On Sep 30, 2019, at 12:52, Daniel Sanders via llvm-dev <llvm-dev at lists.llvm.org> wrote:
> 
> 
> 
>> On Sep 27, 2019, at 12:47, Arsenault, Matthew via llvm-dev <llvm-dev at lists.llvm.org <mailto:llvm-dev at lists.llvm.org>> wrote:
>> 
>> Hi,
>>  
>> I’ve been thinking about what the strategy to use for boolean values in GlobalISel. There are a few semantic and mechanical issues I’ve encountered.
>>  
>> For background, on AMDGPU, there are two kinds of bool/s1 values. Contextually, a real boolean value will either be a 1-bit scalar condition (in a non-allocatable physical condition register, which will need to be copied to an allocatable class 32-bit vreg during selection), or a vector condition where the s1 will really be a lane mask of results in a 32 or 64-bit SGPR.
> 
> I feel like I'm missing something in the description here as a lane mask in an s1 can't describe more than one lane. Is it the difference between an s1 that's the same value on multiple threads in a wavefront and one that has varying values between the threads?

Yes. The type is logically s1 from an isel perspective. Any boolean operation will read the corresponding thread’s bit from the mask. This doesn’t quite match the register class description, because most of the bits are ignored in the single lane view of the program. The semantics of being an s1 disappears during isel. Machine passes no longer care and can optimize based on the lane mask directly (for example, folds for a wave vote function)

> 
>> This will be known during RegBankSelect. To represent this, I’m using two different pseudo register banks used for operands that really function as booleans (select/br conditions, compare results, and a few intrinsics). These two banks physically alias the SGPR bank, but contextually function differently than an s1 value from a non-boolean source (e.g. a G_LOAD of s1 or G_TRUNC to s1). G_SELECT (s1 G_LOAD) would be illegal for example without inserting some kind of compare between the load and select boolean use.
>>  
>> The primary problem I’m currently aiming to solve is losing this contextual pseudo-register bank information once an instruction is selected, as a virtual register can only have either a register bank or register class. Just considering the register class with a type of s1 is ambiguous for whether it was a scalar or vector condition. Once the use instruction is selected, the operands are constrained discarding the register bank before the def instruction is selected.
> 
> Hmm, if the types differed it would be possible to preserve that into register classes in the same way that MIPS MSA does for the MSA128[BHWD] classes. That only works there because COPY isn't allowed to change the type. If the types are the same too then the register coalescer is going to think the classes are interchangable and you might find the classes changing incorrectly.

One option I briefly considered was to steal some bits in LLT for this, but I’m not sure what that should look like. I could theoretically introduce some kind of wave mask type which compares would legalize to, but then this would probably require a new copy-like legalization artifact to replace compare results which is overly complicated.

One hacky way to solve this would be to introduce a pseudo allocatable boolean register class which is used during selection. We already have a non-allocatable boolean class used for the operand descriptions (which selects between wave32/wave64 depending on the target). I could then post-process the function and replace the register class of all of these. I really don’t like this though. We already use a *different* bool pseudo register class hack like this for SelectionDAG. This also has negative side effects elsewhere in the backend. Occasionally code ends up finding these pseudo classes and attempts to do something with them. We also don’t have a way to express this type of artificial class, so for example we end up computing pressure sets for all of these pseudo classes which wastes scheduler compile time.

> 
>>  For the generated selectors, GIM_CheckRegBankForClass will then reject the def instruction.
> 
> That only comes up if the def is already a selected target instruction doesn't it? Unless something is selecting instructions early (in which case it also wouldn't be able to check the type either) then it should still have a register bank at that point.
No, this is always a problem. constrainSelectedInstRegOperands modifies the register class while selecting the use instruction. GIM_CheckRegBankForClass can find the correct boolean bank from the non-allocatable pseudoclass from the target instruction definition, but can’t unambiguously get there from the register class of the virtual register.

> 
>> It’s most natural to consider the destination register class in the selector, but I can imagine a variety of burdensome workarounds for this problem. Currently the manual selection of G_SELECT inserts an extra copy to an extra virtual register when selecting the boolean as a workaround.
>>  
>> The other question solving this depends on is what should be responsible for ensuring booleaness of operands as necessary. With the pseudoregister bank strategy, a COPY will be inserted which can be implemented as the necessary compare. AArch64 currently uses a different strategy. Compares have an s32 result type, G_SELECT uses an s1 input type, and G_BRCOND allows s1, s8, s16 or s32. The selector inserts an ANDSWri on the condition while handling G_BRCOND and G_SELECT conditions. I would have to insert an extract of the low bit, and a compare which seems unnatural to me in the selector for the use. I also think this would be an easily forgotten detail requiring manual selection code for all boolean sourced intrinsics.
>>  
>> For example, in this function:
>> define i32 @foo(i32 %a, i32 %b, i32 %c) {
>>   %cc = trunc i32 %a to i1
>>   %r = select i1 %cc, i32 %b, i32 %c
>>   ret i32 %r
>> }
>>  
>> SelectionDAG legalizes this to:
>>   t0: ch = EntryToken
>>         t2: i32,ch = CopyFromReg t0, Register:i32 %0
>>       t14: i32 = and t2, Constant:i32<1>
>>       t4: i32,ch = CopyFromReg t0, Register:i32 %1
>>       t6: i32,ch = CopyFromReg t0, Register:i32 %2
>>     t8: i32 = select t14, t4, t6
>>   t10: ch,glue = CopyToReg t0, Register:i32 $w0, t8
>>   t11: ch = AArch64ISD::RET_FLAG t10, Register:i32 $w0, t10:1
>>  
>> Such that the selector doesn’t need to worry about the edge case of a non-boolean s1 value.
>>  
>> Can we make truncates to s1 illegal? The documentation currently states it must be legal for all result types for a legal source type, and is a no-op bit operation to make types match only.
> 
> Could you point me at that bit of the documentation? It's probably out of date.
> 
> https://llvm.org/docs/GlobalISel.html#minimum-rules-for-scalars <https://llvm.org/docs/GlobalISel.html#minimum-rules-for-scalars> says something slightly different:
> * G_TRUNC must be legal for all inputs from the producer type set and all smaller outputs from the consumer type set.
> Which means that it's fine for G_TRUNC to s1 to be illegal if you have nothing consuming s1's. If you have something consuming s1's then it's required so that the legalizer always has a way to narrow a scalar.

This is the line I was referring to. There are s1 consumers. It’s just not acceptable to truncate any value to 1-bit and accept it as a boolean source. I could select any G_TRUNC to s1 as a mask the source and compare with 0. This is not a no-op like operation though.

I think selecting G_TRUNC to s1 as a compare is an OK option, but it’s almost backwards from what AArch64 is doing. However, this would still require a mask to get the low bit. I really don’t like needing to insert a mask during selection here. I think that masking being inserted in boolean use selectors is a defect, and different targets should fundamentally be treating these operations in the same way. It almost seems to me like G_TRUNC doesn’t correspond to the IR trunc instruction, and is instead a different legalization glue operation that doesn’t clear bits. Should these be split into two operations, or should the IRTranslator insert a mask when lowering an IR trunc?

-Matt

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20190930/681ea3ae/attachment-0001.html>