[llvm] [AMDGPU] Reset WavesPerEU in AAAMDWavesPerEU::initialize (PR #114162)

Wed Oct 30 10:21:00 PDT 2024

shiltian wrote:

TL;DR This PR hides other issues by treating the symtom.

For the detailed analysis, I'll start with how the AA is expected to work. Typically an AA starts with a "best" state and works towards a "worse" state until reaching a fixed point. If the fixed point is pessmisitic, it is at the "worst" state. For `AAAMDWavesPerEU`, it starts with an empty assumed range, and uses "clamp" (effectively a union) in `updateImpl`. The range just gets wider and wider in iterations of update. This is the reason that `AAAMDWavesPerEU` can work on partial call graph, as mentioned by Matt offline. This also applies to `AAAMDFlatWorkGroupSize`. Or does it? Let's put a mark (1) here and we will come back to this later.

Now let's take a look at in what cases the proposed changes in this PR can work. In current `initialize`, we use `intersectKnown`. This will only update the known range, and the assumed range will not get updated. Later in `updateImpl`, we will update the assumed range. Replacing `intersectKnown` with `&=` will update assumed range as well. If the initial assumed range is narrower than those of all its callers, it doesn't matter because the assumed range gets clamped (wider) anyway. The only case that can make a difference here is, the initial assumed range is wider, while all its callers' are not as wide as it. In this case, if we don't propagate the known range to the assumed range, the result range will end up with a narrower range.

The proposed change in the PR is basically to liberally widen the assumed range and push it to a "worse" state or even "worst" state. This is also the reason that, the attribute is lost in almost all the test case update, because the result range is just as wide as default, such that the `manifest` doesn't bother to add it.

If I understand the baseline discussion correctly, after two runs in the good case, the result range is `(2,8)`, while in the bad case, the range is `(4,8)`. We somehow missed a wider range in the bad case. When could this happen? There might be two possibilities:

1. In the good case, one caller can provide a wider range, while there is no one in the bad case.
2. In both cases, no caller actually provides a wider range. In the good case, we set a wider known range in `initialize` (remember that `intersectKnown` only sets known range), and later we somehow indicate a pessimistic fixed point such that the known range gets propogated to the assumed range, thus eventually in `manifest` we set the attribute with the wider range. However, in the bad case, we don't indicate pessimistic fixed point.

The calculation of `AAAMDWavesPerEU` is a jointed effort of `AAAMDWavesPerEU` and `AAAMDFlatWorkGroupSize`. As mentioned ealier, `AAAMDWavesPerEU` can only get wider, and `AAAMDFlatWorkGroupSize` is same. Now the question is, does a wider `AAAMDFlatWorkGroupSize` lead to a wider `AAAMDWavesPerEU` as well? As I pointed out in the mark (1) earlier, can `AAAMDWavesPerEU` really work on a partial call graph? If a wider `AAAMDFlatWorkGroupSize` lead to a wider `AAAMDWavesPerEU`, it can indeed work on a partial call graph, because both two is monotonic. However, if that is not the case, then it doesn't work, because a half-baked narrow `AAAMDFlatWorkGroupSize` could lead to a wider `AAAMDWavesPerEU`. Based on `getEffectiveWavesPerEU`, a wider `AAAMDFlatWorkGroupSize` could lead to a narrower `AAAMDWavesPerEU` because when the upper of `AAAMDFlatWorkGroupSize` is larger (wider), the lower of `AAAMDWavesPerEU` is larger as well, but a larger lower of `AAAMDWavesPerEU` means a narrower range! One might be wondering, in what cases we use a half-baked value. Well, this will not happen if we only run `AMDGPUAttributor` once.

For `IntegerRangeState`, a pessimistic state is not equivalent to an invalid state, thus the `manifest` will still run even if it is in a pessimistic state. This allows the half-baked attribute to be added to a function. Now with the `AMDGPUAttributor` runs twice, a half-baked `AAAMDFlatWorkGroupSize` can be added into a function in the 1st run. In the 2nd run, this value will be picked up by `AAAMDWavesPerEU` at `initialize`. Based on our disccusion above, this could lead to a wider known range. However, the 2nd run is at post-link stage, thus the call graph is complete, so it doesn't indicate a pessimistic fixed point. Now if the callers' range are not as wide as the known range, we will end up with a narrower range.

That being said, I think the root cause of this issue is the use of half-baked `AAAMDFlatWorkGroupSize`.

https://github.com/llvm/llvm-project/pull/114162