[PATCH] D15302: [Greedy regalloc] Replace analyzeSiblingValues with something new [Part1]

Thu Mar 24 11:58:29 PDT 2016

On Mon, Mar 21, 2016 at 2:44 PM, Tom Stellard <thomas.stellard at amd.com> wrote:
> tstellarAMD added a comment.
>
> In http://reviews.llvm.org/D15302#379497, @wmi wrote:
>
>> I noticed that even without my change, although compiler output "GCN:
>>  NumVgprs is 256", when I looked at the trace of -debug-only=regalloc,
>>  I found there were some VGPR unused.
>>
>> Here is what I did:
>>  ~/workarea/llvm-r262808/dbuild/./bin/llc -march=amdgcn -mcpu=tahiti
>>  -mattr=+vgpr-spilling -verify-machineinstrs <
>>  ~/workarea/llvm-r262808/src/test/CodeGen/AMDGPU/vgpr-spill-emergency-stack-slot.ll
>>  -debug-only=regalloc >/dev/null 2>out1
>>
>> Delete the trace from out1 before the section of "REGISTER MAP", then
>>  execute the command below:
>>  for ((i=0; i<256; i++)); do
>>
>>   grep "VGPR$i[^0-9]" out1 &>/dev/null
>>   if [[ "$?" != "0" ]]; then
>>     echo VGPR$i
>>   fi
>>
>> done
>>
>> The output is:
>> VGPR40
>> VGPR189
>> VGPR190
>>
>> So even if the compiler says GCN: NumVgprs is 256, there are three
>>  VGPRs never used.
>
>
> NumVgprs is the number of VGPRs that need to be allocated for the program, so the fact that there are gaps doesn't matter (though this is strange).  If you use only register v255, you still need to allocate all 256 registers.
>
>

Hi Tom,

I found with my patch here, the Spill num for the testcase increases
from 68 to 152, and Reload num increases from 72 to 188. I havn't
throughly understood what is wrong here, but I can roughly describe
how the problem happen and say it may be a problem of local splitting,
instead of my patch.

In the testcase, there are roughly 64 VReg_128 vars overlapping with
each other consuming all the 256 VGPRs and some other scattered VGPR
uses. Each VReg_128 var occupies 4 consecutive VGPRs, so VGPR
registers are allocated in this way: vreg1: VGPR0_VGPR1_VGPR2_VGPR3;
vreg2: VGPR4_VGPR5_VGPR6_VGPR7; ......

Because we have some other scattered VGPR uses, we cannot allocate all
the 64 VReg_128 vars in register, so splitting is needed. region
splitting will not bring trouble because it only tries to fill holes,
i.e., vregs after the splitting usually will not evict other vregs.
local splitting can bring a lot of mess to the allocation here.
Suppose it tries to find a local gap inside BB to split vreg3
(VReg_128 type). After the local split is done, vreg3 will be splitted
into vreg3-1 and vreg3-2. vreg3-1 and vreg3-2 have short live ranges
so both of them have relatively larger weight. vreg3-1 may find a hole
and is allocated to VGPR2_VGPR3_VGPR4_VGPR5, then vreg3-2 will get a
hint of  VGPR2_VGPR3_VGPR4_VGPR5 and will evict vreg1
(VGPR0_VGPR1_VGPR2_VGPR3) and vreg2 (VGPR4_VGPR5_VGPR6_VGPR7) above.
To find consecutive VGPRs for vreg1 and vreg2, reg alloc will do more
region splitting/local splitting and more evictions, and causes more
and more vregs hard to find consecutive VGPRs.

With my patch, it will add one more VReg_128 interval during splitting
because of hoisting (This is a separate problem I described in a TODO
about improving hoistCopies in previous reply). To allocate the
VReg_128 var, it triggers more region splitting and local splitting,
and makes more vars spilled.

To show the problem, I experimentally turn off local splitting for
trunk without my patch, the Spill num for the testcase drops from 68
to 56, and Reload num drops from 72 to 36. When turn off local
splitting for trunk with my patch, the Spill num for the testcase
drops from 152 to 24, and Reload num drops from 188 to 24.

So this is probably a separate issue for architecture using
consecutive combined registers for large data type.

Thanks,
Wei.