[PATCH] D73446: AMDGPU/GlobalISel: Legalize a16 images

Sun Feb 2 12:04:52 PST 2020

arsenm added a comment.

In D73446#1853703 <https://reviews.llvm.org/D73446#1853703>, @nhaehnle wrote:

> >>>> I see your point about combiner passes -- so could we perhaps just select the image instructions early instead? It's not like we have very interesting things happening for image instructions in the ISel patterns anyway, and one of the benefits of the GlobalISel infrastructure is that it's supposed to be flexible enough for stuff like that...
> >>> 
> >>> I think the only reasonable place outside of selection to do this would be in RegBankSelect/applyMappingImpl, which I did consider before going with the legalizer. We still need to RegBankSelect the intrinsics, and possibly move them into a waterfall loop, and I think introducing real register class constraints earlier is generally undesirable. Eventually, we should run some combiner pass after which should take care of packing code. We could theoretically trick RegBankSelect into handling the selected instructions, but that would also be pretty disgusting. Another small issue with this is it sort of implies making the full set of legalization artifacts legal for <3 x f16> cases. I started working towards this in D72639 <https://reviews.llvm.org/D72639>, but I don't really like it and would prefer if these were all legalized to <4 x s16> during legalization. I think we could use the artifact combiner to eliminate the illegal register types in RegBankSelect, but I don't think that should be mandatory. We claim some of these these are legal today, but that's mostly a hack to deal with missing features in the legalizer.
> >> 
> >> Part of the problem here is that the final machine instructions have register class constraints in the first place. I've been wondering for some time now what register classes even buy us in the end. It seems to me that they're largely useless, and almost everything we need from a conceptual point of view is contained in the register banks. The few complications around M0, SCC, VCC, could be dealt with explicitly since we largely shouldn't allocate them using a generic approach anyway.
> > 
> > The register classes aren't really useless, and we do have a variety of more exotic operand constraints to deal with. We need them to represent cases like operands that don't support M0/exec/whatever, and cases like SReg_96 only supporting one 64-bit subregister. VCC is also a completely normal, allocatable register. The class constraints don't even necessarily matter when allocating, but when folding copies between classes.
>
> Conversely, the straightjacket of register classes causes a lot of pain around image instructions and indirect register indexing. They also complicate every single instance of checking whether a register is SGPR or VGPR, which is something that we do quite a lot.
>
> I don't think it's entirely honest to pretend that VCC is a completely normal, allocatable register. VCC use affects code size, which should be taken into account when allocating it. It's also special due to the interaction with VCCZ. Finally, IIRC the scoreboard in gfx10 treats VCC specially, which may also have implications (I haven't fully thought those through though).

These aren't really constraints, and are merely optimization hints. Trying to treat VCC as different (or only as a physical register) can only penalize code in all of these situations. VCCZ is effectively an alias, and we don't try to make use of it currently. None of these issues impact instruction selection.

> The SReg_96 comment is interesting. Where do we end up with SReg_96 in the first place after legalization? I can only think of indirect indexing. Also, how is it different from SReg_128 not allowing you to take sub1_sub2 as a subregister?

Copies involving physical registers, and also inline asm. It's a question of composing subregisters. If you want a sub0_sub1 of an SReg_96, you may have to copy to get a properly aligned register pair. The way we do calling convention lowering today happens to avoid this in normal cases, but I don't want to rely on this kind of behavior. We'll currently get SReg_96 from 96-bit phis, but we could also start legalizing these into 32-bit pieces. The register class constraints aren't directly relevant to this specific problem, as the main reason I want to defer selection from here in the first place is we don't even have the register bank yet. I could start directly selecting in RegBankSelect, but I don't think that's optimal either.

> 
> 
>>> Is there a way to just relax those constraints for a select subset of opcodes, like the image opcodes?
>> 
>> I don't know what this really would mean
> 
> The thing under discussion here from my perspective is that it's awkward to overload the semantics of image intrinsics in the way that this and related changes are doing, and the question was why we can't just directly go to the final image instructions. One aspect of this is that you'd have a non-generic machine instruction refering to register that don't have a register class, for a couple of passes at least. That doesn't seem too crazy to me.

I'm leaning towards inventing what is essentially a custom G_INTRINSIC type to track the legalization of the awkward cases. The important information will still be tracked by preserving the intrinsic ID operand, but the operands will be changed as here. I think this only requires a small number of wrapper operations (I think 1, but maybe 4 at most). The current intermediate DAG nodes seem to get away with just _d16 variants for dealing with the annoying unpacked register layout case.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D73446/new/

https://reviews.llvm.org/D73446