[llvm-bugs] [Bug 44242] New: Folding casts into phi's in InstCombine sometimes creates redundant registers

Fri Dec 6 09:03:58 PST 2019

https://bugs.llvm.org/show_bug.cgi?id=44242

            Bug ID: 44242
           Summary: Folding casts into phi's in InstCombine sometimes
                    creates redundant registers
           Product: new-bugs
           Version: unspecified
          Hardware: PC
                OS: Linux
            Status: NEW
          Severity: enhancement
          Priority: P
         Component: new bugs
          Assignee: unassignedbugs at nondot.org
          Reporter: cwabbott0 at gmail.com
                CC: hfinkel at anl.gov, htmldeveloper at gmail.com,
                    jay.foad at gmail.com, llvm-bugs at lists.llvm.org,
                    Matthew.Arsenault at amd.com, tpr.ll at botech.co.uk

InstCombine has a transform that does "cast(phi(a, b)) -> phi(cast(a),
cast(b))", but it doesn't check that the phi has a single use, so if we do
something like this where both the original phi and the cast have a use:

int a1 = ...;
while(true) {
    a2 = phi(a1, a4);
    a3 = bitcast_to_float(a2)
    ...
    a4 = bitcast_to_int(a3 * 0.5);
}

... = a2;
... = a3;

then, after instcombine runs, the original loop phi will stick around, and
after folding casts some more, we'll end up with something like this:

int a1 = ...;
float a5 = bitcast_to_float(a1);
while(true) {
    a2 = phi(a1, a4);
    a6 = phi(a5, a7);
    ...
    a7 = a6 * 0.5;
    a4 = bitcast_to_int(a7);
}

... = a2;
... = a6;

and then codegen will insert some extra unnecessary copies and the register
pressure is bloated because a2 and a6 will each have their own register. On
AMDGPU, integers and floats use the same register file, so the first form is
preferable.

For some context as to why we ran into this: In Mesa, our internal IR's don't
distinguish between floating-point and integer types. Furthermore, even though
we're consuming GLSL and SPIR-V, which do have a way to express this
information, most games don't have it because they're transpiled from D3D
bytecode which doesn't have it, and GPU instruction sets universally have one
register file for integer and floating point operations, so it would be mostly
pointless to use it internally. Thus, when translating to LLVM, we pick one
type (either int or float) and insert bitcasts as necessary around operations
so that the resulting value is always that type.

Recently, inside Mesa's driver for AMD, we've been trying to switch from a
legacy IR called TGSI, which doesn't really understand smaller integer types
and happens to pick float as its preferred type, to a newer IR which does
understand small integer types, and hence using float would be uglier, so the
LLVM emitter for the new IR picks ints as its default type and inserts bitcasts
around every floating point operation. We thought that this choice wouldn't
matter, because LLVM is usually pretty good about cleaning up these casts,
except we ran into this case where it doesn't. In particular this happens
whenever there's a loop-carried floating-point value in the original source,
and we use it both with and without casting it after the loop. The cast after
the loop gets CSE'ed with the cast inside the loop, and you get something like
my example. In the case I'm looking at (a shader from Deus Ex: Mankind Divided)
this roughly doubles register pressure compared to the old IR, which happens to
avoid the problem because of how it uses float as the default type. On AMD
GPU's, excess register pressure can be a problem regardless of whether it
spills or not, due to the register-sharing scheme it uses.

I can see two solutions for this inside LLVM:

1. Disable the transform if the phi has other users.
2. Add a MachineGVN pass which happens inside AMDGPU which gets rid of the
extra phi.

I guess the answer depends on which representation should be canonical, the
first or the second. I think the first is better, because having two phi nodes
for the same value might confuse different analysis passes. (Of course, (1) is
also way less effort). But I'm no expert here. The theoretically best form for
this code is probably something like:

float a1 = ...;
while(true) {
    a2 = phi(a1, a4);
    ...
    a4 = a2 * 0.5;
}

... = bitcast_to_int(a2);
... = a2;

but transforming the first snippet to this requires replacing the non-bitcasted
uses of the original phi with a bitcast of the new phi, which would result in
instcombine fighting with itself. In order to settle on this as the best form,
you'd have to do some sort of global analysis on where to optimally insert the
bitcasts, just local transforms aren't going to cut it. It's these sort of
tricky situations that motivated us to not have separate integer and
floating-point types in our IR in the first place.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-bugs/attachments/20191206/a828273c/attachment.html>