[llvm] r298368 - [AMDGPU] Iterative scheduling infrastructure + minimal registry scheduler

Tue Mar 21 10:10:38 PDT 2017

There isn't much functionality required for this. Minimum you need a GCNIterativeScheduler::scheduleRegion function that implements a tentative schedule in IR. Everything other is quite dependent on the target's needs. However in the best scenario I would "strip" code that create DAG SUnits out of current class hierarchies to make scheduling DAG a standalone structure with well defined interface that can be used for anything. I like the idea to be able to annotate DAG SUnits with the values required for a particular scheduler keeping DAG structure as minimal as possible. This not only makes it simpler but also allow to create/save/reuse.

-----Original Message-----
From: Hal Finkel [mailto:hfinkel at anl.gov] 
Sent: Tuesday, March 21, 2017 7:26 PM
To: Pykhtin, Valery; llvm-commits at lists.llvm.org
Subject: Re: [llvm] r298368 - [AMDGPU] Iterative scheduling infrastructure + minimal registry scheduler

On 03/21/2017 10:13 AM, Pykhtin, Valery wrote:
>> Could you elaborate on what "iterative" means here? You talk about using lightweight schedules so that you can rank and compare multiple schedules. Is this being done in the "current iteration schedule vs.
>> next iteration schedule" or in some more general sense?
> Well I meant more general sense. The idea is to be able to generate different schedules, score, compare and implement in IR. The way how it's done in general isn't defined, but this change already have one example: GCN architecture performance is very sensitive on the number of used registers, so this one of the most important parameter to compare between schedules. Look at the GCNIterativeScheduler::scheduleLegacyMaxOccupancy, in brief it works as follows:
>
> 1. All regions across function (kernel) are recorded (bounds) and sorted by register pressure - most demanding at front. At this point, since we know the max number of register used - we can determine the maximum number of waves a GPU can run on a single SIMD.
> 2. tryMaximizeOccupancy pass: run over a sequence of the most demanding regions with minimal register scheduler in a hope that it would reduce the max number of registers. If it succeed - record the schedules to be implemented later.
> 3. Legacy scheduler pass (based on Generic Scheduler). At this point we know the best achievable max register usage and can tell the target for the legacy strategy - run it over recorded regions again. If the legacy strategy succeed to fit in the target - use the resulting schedule, otherwise use either schedule from step 2 or original schedule for the region.
>
> The testcase in the change actually scheduled with minreg scheduler during step 2.'

Interesting, thanks! I've long wondered if, for short scheduling regions, we should just try all valid permutations and pick the best one, and this framework seems like a step in enabling that.

Can you comment on what would need to be updated/changed/enhanced in order to make the iterative infrastructure available to other targets?

  -Hal

>
>
> -----Original Message-----
> From: Hal Finkel [mailto:hfinkel at anl.gov]
> Sent: Tuesday, March 21, 2017 5:50 PM
> To: Pykhtin, Valery; llvm-commits at lists.llvm.org
> Subject: Re: [llvm] r298368 - [AMDGPU] Iterative scheduling 
> infrastructure + minimal registry scheduler
>
>
> On 03/21/2017 09:19 AM, Pykhtin, Valery wrote:
>> Hi Hal,
>>
>> Thank you for pointing this out. I thought a reference at the review is enough. I copy/paste the overview from the review here, as it is up  to date.
> Thanks! This makes it much easier to find things when searching e-mail, git logs, etc.
>
> Could you elaborate on what "iterative" means here? You talk about using lightweight schedules so that you can rank and compare multiple schedules. Is this being done in the "current iteration schedule vs.
> next iteration schedule" or in some more general sense?
>
> Also, do the current strategies iterate at all? I'm trying to get a better feel for how the iterative process will actually work (i.e. why does iteration change the answer).
>
>    -Hal
>
>> Iterative approach to find best schedule is essential for GCN architecture. This change combines a number of ideas for iterative scheduling and present the infrastructure.
>>
>> Lightweight scheduling
>>
>> Default schedulers are scheduling immediately on MIR reordering instructions and updating IR data, such as LiveIntervals. This is relatively heavy - instead a scheduling strategy can return an array of MachineInstr pointers (or equivalent, as does SIScheduler) that defines particular schedule. This lightweight schedule can be scored against other variants and implemented once. There're two types of lightweight schedules:
>>
>> 1. array of pointers to DAG SUnits - supposed to be returned by strategies. The benefit here is that scoring function can use DAG SUnits. Doesn't include debug values.
>> 2. array of pointers to MachineInstr - this is so called 'detached' schedule in the sence that it doesn't depend on DAG state anymore and includes debug values. This is usefull when there is a need to store some variants for a later selection.
>>
>> Scheduling using different strategies require a strategy to preserve DAG state so that other strategies can reuse the same DAG. This can be achieved either by saving touched DAG data, or better not touching DAG at all by annotating DAG SUnits with relevant for a particual strategy information: SUnit has NodeNum field which allows easy annotation not using maps. Minreg strategy implements later approach.
>>
>> GCNUpwardRPTracker
>>
>> Lightweight schedules cannot be tracked using llvm RP trackers, for this purpose GCNUpwardRPTracker was introduced. As the name states it can only go upward inst by inst. The order of inst is defined by the tracker's caller, so it can be used both for tracking lightweight schedules and IR sequences. Upward tracking is easier to implement because it only requires region liveout set to operate, except for one case, when we need to find used livemask for a tuple register use. Despite that for lightweight schedule LiveIntervals isn't updated yet for a given instruction it can be still used because livemask for a use would not change for any schedule, as all defs should dominate the use. The subregister definitions can be reordered, but the overall mask should remain the same.
>>
>> TODO: save liveout sets for every region when recording and reuse for subsequent RP tracking as liveouts doesn't depend on schedule.
>>
>> GCNRegPressure
>>
>> the structure to track register pressure. Contains number of SGPR/VGPRs used, weights for large SGPR/VGPRs and compare function - pressure giving max occupancy wins, otherwise wins pressure with the lowest large registers weight.
>>
>> Minimal register scheduler (example)
>>
>> This is an experimental simple scheduler the main purpose of which is to learn ways how to consume less possible registers for a region (it doesn't care for performance at all). It doesn't always return minimal usage but works relatively well on large regions with unrolled loops. It also used in tryMaximizeOccupancy scheduling pass.
>>
>> Legacy Max occupancy scheduler
>>
>> included as the example and mimics current behavior. It doesn't use lightweight schedules but shows an example of how legacy and lightweight schedulers can be intermixed. The main difference is that it first collects all the regions to schedule and sorts them by regpressure. This way it starts with the fattest region first knowing best achievable occupancy beforehand. It also includes tryMaximizeOccupancy pass which tries to minimize register usage with minreg strategy for the most consuming regions.
>>
>> None of these schedulers are turned on by default in this change.
>>
>> Testing:
>> Legacy Max occupancy scheduler fully passes lit tests.
>> Minreg runs lit tests without asserts.
>>
>> No performance impact so far.
>>
>>
>>
>> -----Original Message-----
>> From: Hal Finkel [mailto:hfinkel at anl.gov]
>> Sent: Tuesday, March 21, 2017 4:43 PM
>> To: Pykhtin, Valery; llvm-commits at lists.llvm.org
>> Subject: Re: [llvm] r298368 - [AMDGPU] Iterative scheduling 
>> infrastructure + minimal registry scheduler
>>
>>
>> On 03/21/2017 08:15 AM, Valery Pykhtin via llvm-commits wrote:
>>> Author: vpykhtin
>>> Date: Tue Mar 21 08:15:46 2017
>>> New Revision: 298368
>>>
>>> URL: http://llvm.org/viewvc/llvm-project?rev=298368&view=rev
>>> Log:
>>> [AMDGPU] Iterative scheduling infrastructure + minimal registry 
>>> scheduler
>> Hi Valery,
>>
>> In the future, please make your commit messages more explanatory.
>> "Iterative scheduling infrastructure + minimal registry scheduler" tells me next to nothing about what this is or how it works. It is obviously an extensive addition/change. The review had a good summary, and that should have appeared here (updated to reflect any changes made as a result of the code review).
>>
>> I'd appreciate it if you would reply to this thread with the updated summary.
>>
>> Thanks,
>> Hal
>>
>>> Differential revision: https://reviews.llvm.org/D31046
>>>
>>> Added:
>>>        llvm/trunk/lib/Target/AMDGPU/GCNIterativeScheduler.cpp
>>>        llvm/trunk/lib/Target/AMDGPU/GCNIterativeScheduler.h
>>>        llvm/trunk/lib/Target/AMDGPU/GCNMinRegStrategy.cpp
>>>        llvm/trunk/lib/Target/AMDGPU/GCNRegPressure.cpp
>>>        llvm/trunk/lib/Target/AMDGPU/GCNRegPressure.h
>>>        llvm/trunk/test/CodeGen/AMDGPU/schedule-regpressure-limit2.ll
>>> Modified:
>>>        llvm/trunk/lib/Target/AMDGPU/AMDGPUSubtarget.h
>>>        llvm/trunk/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
>>>        llvm/trunk/lib/Target/AMDGPU/CMakeLists.txt
>>>        llvm/trunk/lib/Target/AMDGPU/GCNSchedStrategy.cpp
>>>        llvm/trunk/lib/Target/AMDGPU/GCNSchedStrategy.h
>>>        llvm/trunk/test/CodeGen/AMDGPU/schedule-regpressure-limit.ll
>>>
>>> Modified: llvm/trunk/lib/Target/AMDGPU/AMDGPUSubtarget.h
>>> ...
>>>
>>>
>>> _______________________________________________
>>> llvm-commits mailing list
>>> llvm-commits at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
>> --
>> Hal Finkel
>> Lead, Compiler Technology and Programming Languages Leadership 
>> Computing Facility Argonne National Laboratory
>>
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages Leadership 
> Computing Facility Argonne National Laboratory
>

--
Hal Finkel
Lead, Compiler Technology and Programming Languages Leadership Computing Facility Argonne National Laboratory