[llvm] [CodeGen] Use 128bits for LaneBitmask. (PR #111157)

Mon Oct 7 07:18:38 PDT 2024

sdesmalen-arm wrote:

> > I'm not really sure what AMDGPU does that is different or how it encodes the information more efficiently. Are there any lane masks in the table I shared above that you believe use unnecessary regunits?
> 
> This table seems to have one bit for every subregister index. I would expect overlapping tuples to use multiple bits of mask. e.g. AMDGPU has this:
> 
> ```
> hi16 :   L0000000000000001 EMPTY
> lo16 :   L0000000000000002 EMPTY
> sub0 :   L0000000000000003 EMPTY
> sub0_sub1 :   L000000000000000F EMPTY
> sub0_sub1_sub2 :   L000000000000003F EMPTY
> ...
> ```

For the following acronyms:
* bl = bsub low 8 bits
* bh = bsub high 8 bits
* hh = hsub high 16 bits
* sh = ssub high 32 bits
* dh = dsub high 64 bits
* qh = qsub high 128+ bits

Such that:
* 16-bit subregister 'hsub' <=> `bl | bh`
* 32-bit subregister 'ssub' <=> `bl | bh | hh`
* ..
* 128-bit subregister 'qsub' <=> `bl | bh | hh | sh | dh`
* 128+ bit subregister 'zsub' <=> `bl | bh | hh | sh | dh | qh` (z registers are scalable vector registers of 128-bits or more)

I would expect to have at least 6 regunits for: `qh,dh,sh,hh,bh,bl` to represent all addressable sub-registers in a single 128+ bit reg. At the moment it would add new regunits for tuples such that for 64-bit D register tuples (DD, DDD, DDDD) TableGen allocates the following regunits:
```
             sh,hh,bh,bl           sh,hh,bh,bl
  sh,hh,bh,bl           sh,hh,bh,bl
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  16                                         0
```

For 128-bit Q register tuples (QQ, QQQ, QQQQ) it would then create the following (additional) regunits:
```
                dh,sh,hh,bh,bl              dh,sh,hh,bh,bl
  dh,sh,hh,bh,bl              dh,sh,hh,bh,bl
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  20                                                     0
```
And then it would do a similar thing for Z register tuples (ZZ, ZZZ, ZZZZ). In total, this takes up (4 * 4) + (4 * 5) + (4 * 6) = 60 bits, which I agree is unnecessary.

The way I think we want to represent this, is as follows:
```
                          z2,q2,d2,qh,dh,sh,hh,bh,bl                          z0,q0,d0,qh,dh,sh,hh,bh,bl
z3,q3,d3,qh,dh,sh,hh,bh,bl                          z1,q1,d1,qh,dh,sh,hh,bh,bl
```
Which would only require 4 x 9 = 36 bits in total and would allow representing all tuples.

That said, I'm at wits end of how to represent this in TableGen, because with every try and turn I run into some TableGen assertion failure. I've tried using `ComposedSubRegIndex` to no avail. Not sure if there's some TableGen bugs I'm running into, or whether I'm just not describing this the right way using the existing constructs.

Could you give me some suggestions on what the right way is to represent it?

I do wonder if extending the number of bits for LaneBitmask is such a big problem in practice. I suspect at some point we'll need to have a wider bitmask anyway, given that all registers for a target use the same encoding space in lanebitmask. As was pointed out for AMDGPU it's already at the limit.

https://github.com/llvm/llvm-project/pull/111157