[Mlir-commits] [mlir] [MLIR][Linalg] Add aggregate ops decomposition pass and softmax decom… (PR #97582)
Stella Laurenzo
llvmlistbot at llvm.org
Wed Jul 24 15:15:28 PDT 2024
stellaraccident wrote:
> I have been trying to find a way to help land this PR without asking for too much "as a precursor" work, but given that there hasnt been much change in the approach the real issue IMO are two
>
> 1. It seems like there is a missing control in decomposition in general. An op might have multiple ways of decomposition that should be controllable (either through registering different interfaces or having an options struct that allows you to control the decomposition, dont know about the exact mechanism). I dont think there is a way this PR can land in its current form without introducing such a mechanism from the get go. We can decide the defaults, but the optionality to decomposition needs to be built (I think this should be done anyway, and I am happy to do it, but its not a priority for me right now)
> 2. If named ops didnt have (what I consider) an outdated semantics for broadcast handling, then there would be a path where you could just lower to named ops without structurally changing the lowering. That is also another path to make this work. This is also worth doing, but requires a broader agreement on direction.
>
> > > 2. The change to the decomposition of softmax IMO is an inefficient lowering of softmax and will require "something" to get the state back. This should be part of the PR that is changing the decomposition. It is moving from a more succinct representation that Linalg allows to something that is (artifically) hamstrung with current definitions of the named ops. I dont expect the issue with named ops to be fixed as a precursor (though that would be the right thing to do IMO), but for this PR, I dont see how we can land it without having an option to chose how to decompose softmax (with default being what it is today, and an option to lower to sequence of named ops). On top of that adding a generalization to convert everything to `linalg.generic`s is a non-starter IMO. You will be forcing all downstream users to either use "recognizers" heavily to retrieve back the information that is lost by generalization and not giving downstream users control on when they want to generalize.
> >
> >
> > Ok, I see. So this position is the opposite of what I'm proposing: changing the default decomposition to target named ops (note that this has nothing to do with generalization).
> > Here I'm summarizing the arguments for preserving the status quo and against it.
> > Cons of changing default:
> >
> > 1. Additional steps are required to reach the same IR state.
> > 2. The IR is less succinct: explicit broadcasts are inserted due to named ops requirements.
> >
> > Pro of changing default:
> >
> > 1. Current decomposition is a mix of an actual decomposition to exp, sum, sub, max, and div named ops + partial generalization + some fusion (I'll call the existing one a _mixed decomposition_ here to differentiate between the proposed approach and the existing one). The proposed approach limits decomposition to a single responsibility.
>
> I dont agree with this characterization. If you want to lower to named ops, which are you generalizing after, and representing broadcasts more succinctly is not a fusion IMO. This seems like it is transplanting ideas from tosa/torch etc. into Linalg. There you need to have broadcast as a separate operation. You dont need that in Linalg. I wouldnt characterize it as a fusion (rather tosa/torch are artifically forcing front-ends/lowering/programmers to introduce a broadcast since they dont have mechanisms to represent broadcasts effectively).
>
> > 2. Separating these three stages is beneficial because you can control the behavior better. For example, after decomposing softmax into a sequence of named ops one can fuse and tile them together with another named op that was not part of softmax. With the current approach, you'd still need the fusion pass run after the mixed decomposition to reach the same state, so pipeline complexity is the same. Moreover, new possibilities open up for pipelines that don't want to generalize the result of the decomposition.
> >
> > _Mitigating cons n.1_: Even though reaching the same result of the mixed decomposition requires additional steps, those are existing upstream transformations. Hence, the downstream changes won't be a significant burden. _Mitigating cons n.2_: As for broadcasting, as I mentioned earlier, adding implicit casting for named ops is an option. Though I still don't see an actual problem with the "same rank" requirement other than it is "unnecessary", I'm willing to work on it if it proves valuable.
>
> Again, I dont agree that this is implicit casting. This is very much explicit representation of broadcasting behavior. And if you think downstream changes to get back to current state is not a significant burden, please add that to your PR and then lets discuss how to package it from a user perspective. I could very well come and change the behavior back because it "suits my need" and we will never be able to reach a stable state.
>
> > I think you missed the point. The proposed decomposition only converts an aggregate operation into a sequence of non-aggregate ones. This has nothing to do with generalization. Downstreams don't need to recognize a `fill` from its generic form. The solution for you would be to do partial generalization, leaving `fill`s intact.
>
> Again I disagree. Representing broadcasting semantics more succinctly is not about it being an "aggregate" op, but rather there should never have been a need to have a `linalg.broadcast` operation in the first place IMO (and you are over-indexing on the `fill` issue. I understand completely how to do partial generalization). There was never an agreement that decomposing a softmax op into named ops as it exists today is the way to go.
Flyby review of a conversation that appears to be looping.
For #1 (controllable (de)composition) -- big +1. This is how pytorch does it, and it is a super power (and a big part of what is making things attractive there). Basically, you end up with a way to say what your backend prefers and the framework gets you as close to that as it can. It is a practical way to get out of the issue that really, no one agrees on every detail of this stuff (and never will / there is no best).
For #2 -- I'm watching this same basic discussion loop on multiple threads and PRs (basically the role of named ops and broadcasting). We're using sloppy language (calling it cast, explicit, implicit, fusion, etc), so I'm not going to add to that. But it is quite clear to me that there are a couple of opposing viewpoints on this being ground out patch by patch (with the usual amount of friction that entails). My tendency is to side with Mahesh's viewpoint on this -- not because of upstream/downstream/whatever -- but because that viewpoint is more compatible with all of the transformations that we should be able to do on this opset (and I've lived too many lives with the really substandard backdrop of trying to use the fixed function op libraries of the frameworks for transformations and optimizations). But if I squint, I can see the value in the "named op everything" viewpoint *iff* it is part of a holistic design supporting configurable specialization and robust generalization.
I don't want to litigate any of this on a PR like this, but I do think there are a couple of broader discussions here that we'd be better off to have.
https://github.com/llvm/llvm-project/pull/97582
More information about the Mlir-commits
mailing list