[PATCH] D114832: [SROA] Improve SROA to prevent generating redundant coalescing operations.

Wed Dec 8 10:26:33 PST 2021

luna added a comment.

In D114832#3179919 <https://reviews.llvm.org/D114832#3179919>, @luna wrote:

> In D114832#3179890 <https://reviews.llvm.org/D114832#3179890>, @luna wrote:
>
>> In D114832#3179810 <https://reviews.llvm.org/D114832#3179810>, @davidxl wrote:
>>
>>> In D114832#3178967 <https://reviews.llvm.org/D114832#3178967>, @lebedev.ri wrote:
>>>
>>>> In D114832#3177534 <https://reviews.llvm.org/D114832#3177534>, @davidxl wrote:
>>>>
>>>>> The bad code pattern generated is probably not general enough to be useful to introduce a cleanup pass or to enhance existing pass to do so -- it will probably just shift the complexity from one pass to another. Fixing this at the source (SROA) is reasonable (unlike other canonicalization pass, this code pattern from SROA actually makes IR much worse)
>>>>
>>>> Who defines what is/isn't reasonable? :)
>>>>
>>>> As i see it, the input IR may or may not already have such bad patterns that are similar to the bad patterns you are trying to avoid producing here.
>>>> Trying to avoid producing it has compile time cost: https://llvm-compile-time-tracker.com/compare.php?from=0850655da69a700b7def4fe8d9a44d1c8d55877c&to=8a4304c8f4253fb1944e5e5988a24285a14181c4&stat=instructions
>>>> So, you've paid for avoiding producing more bad patterns, but you are still left with the ones that were already there.
>>>>
>>>> Alternatively, instead of paying for trying not to introduce bad patterns, you could pay for trying to improve all of the patterns afterwards, regardless of their source.
>>>> Then, you still end up paying, but end up with not just the SROA patterns improved.
>>>>
>>>> This is a very standard logic behind IR changes - if something doesn't understand the pattern, generally don't try to workaround it elsewhere, just fix the missing piece.
>>>
>>> I agree with most of what you said  as a general principle, while analysis should be done case by case :)
>>>
>>> The assumption is that the producer (SROA) is pretty much the only creator of the pattern (excluding manually written IR), thus cleaning it up downstream does not really help in sharing. The compile time cost is also shifted. Imagine another hypothetical scenario: if we there are N different unique bad patterns that can be handled by one single analysis pass in the source pass, we may need to have N different downstream pass changes to handle them resulting in more waste.
>>>
>>> Having said that, suggestions on which downstream pass to handle this is useful. IIRC Mingming considered CodegenPrepare where some taildup can be done to handle tailcalls ...
>>
>> thanks for the discussions and insight everyone!
>>
>> The compile time link is helpful. Is there a pointer or rule-of-thumb to evaluate cost vs benefit?
>>
>> - Ask since the current analysis is inlined in an existing instruction walker. Adding a separate transformation (as opposed to choose a good existing one to piggyback on) would likely increase the cost (with the same effect).
>>
>> It seems the other option to consider is have a separate pass to transform bad patterns, or reuse an existing pass like CodeGenPrepare.
>>
>> It'd be helpful to know a pass that I should probably look more into, to achieve the original optimization goal here.
>
> Driven by the compile-time feedback, another thing for me to consider, (probably after a preliminary convergence on the pass to add analysis info or do the transform), is to prune the existing solution so it's faster on the benchmark.

I did a quick experiment in https://godbolt.org/z/EWWaMedsx, which basically transforms {a series of PHI, one OR} into {a series of OR, one PHI}. This approach is not implemented in the first place, because result after transformation (i.e. {a series of OR, one PHI}) might be more machine instructions, and more expensive.

Some more context before starting to implement:

1. For llvm/test/Transforms/SROA/alloca-struct.ll, original cpp and assembly code is as shown in https://godbolt.org/z/qTY6MfMGa (mentioned above)
2. IR before the first SROA pass is https://gist.github.com/minglotus-6/24bccb53d5e101cb20a7118f5e9f389f IR after SROA is like https://gist.github.com/minglotus-6/1e6b93d52687e7fbbb43942563134117
  - In particular, the alloca instruction is allocating a struct before SROA pass, and scalarized in SROA pass, and lead to the verbose instructions (lshr, trunc, etc).

Based on this, the basic essence is not to create scalarization in the first place of the usage is whole. The idea of skip scalarization also exists in GCC (e.g., The relevant PR are PR 40744 and PR 44423, mentioned in xi) of 2nd section of https://gcc.gnu.org/wiki/summit2010?action=AttachFile&do=get&target=jambor.pdf).

The IR of `test12` of `llvm/test/Transforms/SROA/basictest-opaque-ptrs.ll` is also more succinct (I didn't look into the codegen result though)

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D114832/new/

https://reviews.llvm.org/D114832