[llvm-dev] top-down vs. bottom-up list scheduling

Tue Nov 6 11:19:16 PST 2018

On 11/6/2018 6:19 AM, Sjoerd Meijer via llvm-dev wrote:
> Hello List!
>
> I am looking at top-down vs. bottom-up list scheduling for simple(r) in-order
> cores. First, for some context, below is a fairly representative pseudo-code
> example of the sort of DSP-like codes I am looking at:
>
>    uint64_t foo(int *pA, int *pB, unsigned N, unsigned C) {
>      uint64_t sum = 0;
>      while (N-- > 0) {
>        A1 = *pA++;
>        A2 = *pA++;
>        B1 = *pB++;
>        B2 = *pB++;
>        ...
>        sum += ((uint64_t) A1 * B1) >> C;
>        sum += ((uint64_t) A2 * B2) >> C;
>        ...
>      }
>      return sum;
>    }
>
> These kernels are very tight loops. In this sort of legacy codes it's not
> uncommon that loops are manually tuned and unrolled with the available
> registers in mind, so that all values just about fit in registers. In the
> example above, we read from input streams A and B, and do 2 multiply-accumulate
> operations.  The loop is unrolled 2 times in this example, but 4, 8 or 16 would
> more realistic, resulting in very high register pressure.
>
> But legacy apps written in this way are not the only culprit; we see that with
> (aggressive) loop unrolling we can end up in exactly the same situation: the
> loopunroller does exactly what it needs to do, but it results in very high
> register pressure.
>
> Here's the (obvious) problem: all live values in these loops should just about
> fit in registers, but suboptimal/wrong decisions are made resulting in a lot of
> spills/reloads; the machine scheduler is making the life of the register allocator
> very difficult. I am looking at regressions of about 30 - 40% for more
> than a handful of kernels, and am thus very interested in what I could do about
> this.
>
> The first observation is that it looks like we default to bottom-up scheduling.
> Starting bottom-up, all these "sum" variables are scheduled first, and after
> that the loads (this is simplifying things a bit). And thus it looks like we
> create a lot of very long live-ranges, causing the problems mentioned above.
> When instead we start top-down, the loads are picked up first, then the MAC, and
> this repeated for all MACs. The result is a sequence LD, MAC, LD, MAC, etc.,
> which is let's say a sequence more in program-order, also with shorter
> live-ranges. This does exactly what I want for these sort of kernels, it generates
> exactly the code we want.

Do you know why the existing register pressure heuristics don't work for 
your testcase with the bottom-up scheduler?

-Eli

-- 
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project