[PATCH] D150312: [MISched] Introduce and use ResourceSegments.

Tue May 30 05:31:18 PDT 2023

andreadb added a comment.

In D150312#4375654 <https://reviews.llvm.org/D150312#4375654>, @fpetrogalli wrote:

> Thank you for the feedback @andreadb
>
> I need some time to digest it to be able to give you an answer. I have however inlined a clarification of how I have interpreted StartAtCycle and ResourceCycle. (I'd be of course happy to revisit my interpretation if it makes things more clear)
>
> In D150312#4371743 <https://reviews.llvm.org/D150312#4371743>, @andreadb wrote:
>
>> Hi Francesco,
>>
>> Apologies for the very late reply. I have been quite busy these days, and I am still trying to figure out mentally how well this new framework works in practice.
>>
>> I am thinking about edge cases where writes of a same instruction somehow conflict in terms of resource cycles and/or `StartAtCycle` quantities.
>>
>> If I understand it correctly, `StartAtCycle` forces the scheduling algorithm to find valid segments after that relative cycle. Scheduling won't fail if no segment can be allocated exactly at that cycle.
>> Basically, resource consumption is requested to start not before `StartAtCycle`. However, it is OK to start after `StartAtCycle` if slot allocation is unsuccessful.
>> Is that correct?
>>
>> If so, then what happens if I declare the following write:
>>
>> Write W = [ A (cycles=1, start_at=0), B (cycles=1, start_at=0), AB (cycles=3, start_at=1) ].
>
> I do not have an answer (yet!) on the question following this set up, however I wanted to clarify that the way I have intended StartAtCycle and ResourceCycle in the tablegen description is as follows.
>
> For a resource RES used in a WriteRes, that is used for 3 cycles starting at cycle 2, the tablegen description I expect to use is the following:
>
>   def : WriteRes<..., [RES]> {
>     let ResourceCycles = [5];
>     let StartAtCycle = [2];
>   }
>
> This mens that the total number of cycle for resource RES is given by the difference between the corresponding values in ResourceCycles and StartAtCycle respectively, which results in `5 - 2 = 3`.
>
> The reason for this choice is the following. In the current code resource usage is always considered booked from 0 (resulting in overbooking). For example, given ResourceCycle = [1,3] for resources A, B, I assumed the meaning being A for 1 cycle, followed by B for 2 cycles (cycle 0, overlapping the use of A , is overbooked for B).
>
>   cycle 0 | 1 | 2 | 3 | 4 | 5
>   A     X
>   B     X   X   X
>
> I decided to reinterpret the ResourceCycles as "ReleaseAtCyc;e" because I did not want to mess with the values of the current scheduling models in case people wanted to optimise the existing one with similar situation by just adding StartAtCycle = [0,1], without the need of changing any of the values in resourceCycles. By setting StartAtCYcle = [1,2] for the example, we would get the real resource usage without having to change ResourceCYcle from [1,3] to [1,2].

Thanks, that makes sense.

For most devs, ResourceCycles is just a measure of latency; it declares for how many cycles a resource becomes unavailable after "instruction issue".
To avoid confusion, I suggest to add a code comment that further emphasizes how `ResourceCycle` actually means `StopAtCycle`. Otherwise, some people may wrongly assume that ResourceCycles is relative to `StartAtCycle`.

>   cycle 0 | 1 | 2 | 3 | 4 | 5
>   A     X
>   B         X   X
>
> Essentially to me resource usage is from StartAtCycle to resourceCycles , while I *think* that you intend resource being booked from StartAtCycle to StartAtCycle + ResourceCycles.
>
> If we proceed with my interpretation, I think it will be easier to mass rename ResourceCycles to ReleaseAtCycle, instead of having to manually figure out the math that is needed to optimise the existing scheduling models.
>
> Of course, this is my person preference, and I am totally fine in changing the code to your interpretation.

I was essentially asking whether instruction issue still works the same way or not.

Is it required that ALL resource segments are reserved/allocated at issue cycle?
To put it in another way: Is the `StartAtCycle` a hard requirement for resource allocation? Would it prevent instruction issue if some but not all resources are available at their StartAtCycle?

The reason why I am asking these questions is because I want to fully understand how flexible is this new model in practice. In future, it would be nice if we could model resource consumption on a per-micro-opcode basis.

As you know, writes may declare multiple "micro-opcodes". However, there is no way currently to describe which micro-opcodes consume which hardware resources.
For that reason, scheduling algorithms don't track the lifetime of individual micro-opcodes; instead, instructions are essentially treated like atomic entities.
So, "issue event" - as a concept - can only apply to instructions as a whole and not individual micro-opcodes.

Example:

  A write W declaring the following:
   - 3 micro-opcodes
   - Resource consumption :
       A - 1cy
       B - 1cy
       C - 2cy   -- StartAtCycle=1

Given the following scenario:

  cycle 0 | 1 | 2 | 3 |
  A     -
  B     -  
  C     X   X

In this scenario, instruction issue must be delayed because not all resources are available at their relative `StartAtCycle`.
C is busy for two more cycles, so issue of write W must be delayed for 1 extra cycle (even though A and B are already available).

However, if we knew that each resource was independently consumed by a different micro opcode of W, we could have extracted a bit more parallelism.
To put it in another way, write W could have "partially" issued at relative cycle 0. However, the micro-opcode consuming resource C would have been delayed by an extra cycle.

This obviously would have complicated the model, and it would have required per-micro-opcode knowledge which we don't have at the moment.
So, this is not feasible now, but it could be a future development (although unlikely).

I was wondering whether your `StartAtCycle` could have simplified that future development or not. I think your design is completely orthogonal, and it shouldn't prevent any further development in the area of modelling micro-opcode scheduling.

So, overall I think that your StartAtCycle is a nice and useful addition.

> Having said that, I'll take a look into the issue you are reporting. It would really help if you could point me at some scheduling models in the sources that use the resource groups mechanism you describe, because I could use it as a starting point to play with it and see what happens.
>
> Thanks!

Have a look at the Haswell model on x86.

  defm : HWWriteResPair<WriteFHAdd,   [HWPort1, HWPort5], 5, [1,2], 3, 6>;

It describes the latency/throughput profile of an horizontal add.
An horizontal add is composed of 3 micro-opcodes:

- 2 shuffles (can only execute on HWPort5).
- 1 ADD (can only execute on HWPort1).

Shuffle opcodes are independent from each-other and can start execution immediately. The ADD opcode will have to wait for the shuffles to complete. The ADD could be marked as `StartAtCycle=1`.
HWPort5 is a bottleneck for shuffle opcodes; the 2 shuffle micro-opcodes must be serialised. That explains why it has a resource consumption of 2cy.
There is an issue in that definition: HWPort1 is only consumed for 1 resource cycles. However, it should be 3 (if we want ResourceCycle to mean `StopAtCycle`).

I am sure that there are several other (bad and good) examples in that file which also involve group resources.
On AMD platforms, shuffle opcodes can be issued to multiple pipes, so that bottleneck would not exist. For those models, the tablegen definition would be similar except that it would use a resource group instead of a unit for the two shuffles.

-Andrea

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D150312/new/

https://reviews.llvm.org/D150312