[PATCH] D103865: [SystemZ] Generate XC loop for memset 0 of variable length.

Mon Jun 28 11:00:03 PDT 2021

uweigand added a comment.

In D103865#2837530 <https://reviews.llvm.org/D103865#2837530>, @jonpa wrote:

> I updated the patch to create the length-1 and trip count on the DAG instead.

Thanks!

> I see that this means that they are initially placed before the zero-length check, while before they were put in the preheader of the loop. The SRLG is actually moved to the preheader by MachineSink later, and this could probably be done for the AGHI as well if the comparison would be made with the AGHI source against 0.
>
> To keep things simple, I just changed the comparison to be against -1 instead for now. What do you think about that - is the zero-length case rare enough to ignore where the AGHI is placed? We could do the extra work of finding the AGHI/-1 and comparing with 0 of that source reg instead, and expect the AGHI will be sunk later if there are no other users I would hope.

I think this is probably OK.   GCC does the compare with 0, but there's probably not much difference.   It's a pity we cannot move that check up to SystemZSelectionDAGInfo::EmitTargetCodeForMemset, but I believe we cannot create new basic blocks on the DAG level.

>> Hmm, that's not quite what I had in mind, sorry. This code still has two completely separate code paths, just now in one function. What I had been thinking of is to actually share the same code path that e.g. emits the loop -- they should be identical except that the trip count is an immediate vs. a register.
>>
>> The point is to merge the two paths to the extent where there is no longer any point in extracting buildMemMemLoop as a subroutine as it actually is used only once.
>
> I gave it another try: First let the reg/immediate cases set up the needed MBBs and then build the loop afterwards for both cases in one place. I guess that seems more simple than having a separate class for creating the loop...

That's better, thanks!   I think readability might be even better if you continue to create the MBBs in order, i.e. emit the EXRL at the end instead of at the beginning.  (That means a duplicated  LengthMO.isReg() test, but that should still be better.)

>> That makes sense. It should be straightforward to sort and de-duplicate the target instructions at final emission time.
>
> AsmPrinter::EmitToStreamer() calls getSubtargetInfo() which needs an MF, so it did not seem quite simple to emit all the EXRL targets in emitEndOfAsmFile(). That method seems to be intended for other things than emitting instructions. Does it seem right to try to create a new section at that point after all else and emit the target instructions?

Ah, that's unfortunate, but I guess understandable.  The assumption is that code emission needs subtarget-specific info, and the subtarget is defined by attributes (like target CPU) that change between functions.  At the end of the asm file, we're outside of all functions ...

There are ways around that.  We do have a TargetMachine in the AsmPrinter, and there used to be a function getSubtargetImpl() that returns a "default" subtarget for the compilation unit.  This has been deprecated (for the above reasons), but could probably be re-activated, and then be used as the subtarget to pass to emitInstruction during emitEndOfAsmFile.

Another option would be to simply "assemble" those instructions by hand, i.e. compute the 6-byte integer value from the constituent fields.  Something like:

  Opcode =  TargetInsOpc << 40 | LenMinus1Reg << 32 | DestReg << 28 | DestDisp << 16 | SrcReg << 12 | SrcDisp;

and then simply emit that value as 6 bytes of "data" (but still into the .text section, of course).    This is also somewhat ugly, but maybe less ugly than the default Subtarget ...

================
Comment at: llvm/lib/Target/SystemZ/SystemZAsmPrinter.cpp:568
+        return;
+      }
+
----------------
Ah, if you do it that way, you don't actually have to have multiple labels for the same instruction.  Just compute the instruction (or just its opcode, see discussion in the main comment) first, and look it up in the table.  If found, simply return its (one) associated symbol; if not, create a new symbol and add the pair of symbol, instruction to the table.

Actually, that table can then just be a std::map (from instruction to symbol) instead of a vector, this implements the lookup in a more efficient way ...

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D103865/new/

https://reviews.llvm.org/D103865