<div dir="ltr">Here's a proposal that would introduce the IR transform pass that I was thinking of:<br><a href="https://reviews.llvm.org/D73480">https://reviews.llvm.org/D73480</a><br><br><div>The review also has a list of existing motivating bugs that could be handled there. Looking back at the comments about the integer multiply example, unfortunately an IR pass might be too early to deal with that particular problem.<br></div><div><br></div><div>I do want to clear up a potential misunderstanding though - unless there's a bug, InstCombine isn't creating vector math ops even if there are surrounding extract/insert ops. AFAIK, we've always assumed that's off-limits without a cost model. So that's more likely happening in SLP.</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sun, Jan 12, 2020 at 9:03 AM Sanjay Patel <<a href="mailto:spatel@rotateright.com">spatel@rotateright.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>I've seen several examples that I'd also call "reducing scalarization" rather than vectorization. Here's the most recent case:<br></div><div><a href="https://bugs.llvm.org/show_bug.cgi?id=44478#c1" target="_blank">https://bugs.llvm.org/show_bug.cgi?id=44478#c1</a></div><div><br></div><div>I like the idea of introducing a place where we can deal with these patterns generally. But I have a slightly different idea - I'd like to create an IR-level "VectorCombiner". This would be a late IR pass that runs before and/or after the vectorizers. It would behave something like InstCombine - iterative peephole pattern-matching and transforms. But it would have access to the TTI cost models, so targets can opt out of anything in that pass (or opt out of the pass entirely if they want).</div><div><br></div><div>The reasons for doing it this way are:</div><div>1. The transforms we're talking about don't fit into SLP or Loop Vectorization, but they would make those passes and codegen more successful by turning semi-vector patterns into more standard forms.<br></div><div>2. Doing the transforms in IR can enable further optimizations in generic passes like InstCombine.<br></div><div>3. Practical matter - extending the vector passes is difficult. SLP in particular is too complicated in my experience. Gluing on some seemingly simple extensions has led to miscompiles/regressions that nobody has solved, and I don't see that changing.</div><div></div><div>4. Another practical matter, it's easier to work on IR. The pattern-matching is templated and not limited to a basic block like SDAG. Presumably, that argument goes away with GlobalISel, but go back to #2. (And presumably, that's why we do vectorization on IR in the first place.)<br></div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sat, Jan 11, 2020 at 6:29 PM Nemanja Ivanovic via llvm-dev <<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>Absolutely. We do it for scalars, so it would likely be a matter of just extending it.</div><div><br></div><div>But that is one example. The issue of extracting elements, performing an operation on each element individually and then rebuilding the vector is likely more prevalent than that. At least I think that is the case, but I'll do some analysis to see if it is so or not.<br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sat, Jan 11, 2020 at 6:15 PM Craig Topper <<a href="mailto:craig.topper@gmail.com" target="_blank">craig.topper@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><div dir="auto">For the MULHS example couldn’t that be fixed by adding a DAG combine to create vector MULHS before type legalization?</div></div><div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sat, Jan 11, 2020 at 3:08 PM Nemanja Ivanovic via llvm-dev <<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div>Thanks so much for your feedback Simon.</div><div><br></div><div>I am not sure that what I am proposing here is at odds with what you're referring to (here and in the PR you linked). The key difference AFAICT is that the pattern I am referring to is probably more aptly described as "reducing scalarization" than as "vectorization". The reason I say that is that the inputs are vectors and the output is also a vector - we just perform the operation on extracted elements rather than on the input vectors themselves.</div><div><br></div><div>In the PR you linked, there is an example that shows the difference (simplified to <span style="font-family:monospace"><2 x double></span> for brevity):</div><div><span style="font-family:monospace">define dso_local <2 x double> @test(i64 %a, i64 %b) {<br>entry:<br> %conv = uitofp i64 %a to double<br> %conv1 = uitofp i64 %b to double<br> %vecinit = insertelement <2 x double> undef, double %conv, i32 0<br> %vecinit2 = insertelement <2 x double> %vecinit, double %conv1, i32 1<br> ret <2 x double> %vecinit2<br>}</span></div><div><br></div><div>The inputs here are scalars so I suppose it is quite possible (perhaps likely) that on some targets, doing the insert with integers and then converting the vector is cheaper (although this is definitely not the case with PPC).</div><div>But what I am proposing here is actually handling something like this:</div><div><span style="font-family:monospace">define dso_local <2 x double> @test(<2 x i64> %a) {<br>entry:<br> %vecext = extractelement <2 x i64> %a, i32 0<br> %vecext1 = extractelement <2 x i64> %a, i32 1<br> %conv = sitofp i64 %vecext to double<br> %conv2 = sitofp i64 %vecext1 to double<br> %vecinit = insertelement <2 x double> undef, double %conv, i32 0<br> %vecinit3 = insertelement <2 x double> %vecinit, double %conv2, i32 1<br> ret <2 x double> %vecinit3<br>}</span></div><div>With this type conversion, InstCombine will actually simplify this as expected. And I think that is the right thing to do - I can't see the scalarized version being cheaper on any target. Since we already do something quite similar in InstCombine, I would assume it would be rather uncontroversial to do it on the SDAG.</div><div><br></div><div>Now, a reasonable question might be "why do it on the SDAG if we already do it in InstCombine?" And the short answer is it is quite possible that legalization will introduce scalarization code and a subsequent DAG combine creates an opportunity to remove the scalarization. Here is an example of that:</div><div><span style="font-family:monospace">define dso_local <2 x i64> @testv(<2 x i64> %a, <2 x i64> %b) {<br>entry:<br> %sexta = sext <2 x i64> %a to <2 x i128><br> %sextb = sext <2 x i64> %b to <2 x i128><br> %mul = mul nsw <2 x i128> %sexta, %sextb<br> %shift = lshr <2 x i128> %mul, <i128 64, i128 64><br> %trunc = trunc <2 x i128> %shift to <2 x i64><br> ret <2 x i64> %trunc<br>}</span></div><div>On PPC, the legalizer will scalarize this since we do not have v2i128. Then the DAG combiner will produce the pattern I am referring to in this RFC:</div><div><pre style="box-sizing:inherit;margin:4px 0px;padding:8px;font-size:12px;line-height:1.50001;font-variant-ligatures:none;white-space:pre-wrap;word-break:normal;font-family:Monaco,Menlo,Consolas,"Courier New",monospace;border-radius:4px;color:rgb(29,28,29);font-style:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:left;text-indent:0px;text-transform:none;word-spacing:0px;text-decoration-style:initial;text-decoration-color:initial"><span style="font-family:arial,sans-serif"><span style="font-family:monospace">(v2i64 build_vector (mulhs (extractelt %a, 0), (extractelt %b, 0)),<br> (mulhs (extractelt %a, 1), (extractelt %b, 1)))</span></span></pre></div><div>And if the target has <span style="font-family:monospace">mulhs</span> legal for the vector type, this is strictly worse. So no matter what we do in InstCombine or the SLP vectorizer, we will end up with non-optimal code.</div><div><br></div><div>If we also handle shuffles of input vectors, we can catch things such as the following:<br></div><div><span style="font-family:monospace">define dso_local <4 x float> @test(<4 x i32> %a, <4 x i32> %b) {<br>entry:<br> %vecext = extractelement <4 x i32> %a, i32 0<br> %vecext1 = extractelement <4 x i32> %a, i32 1<br> %vecext4 = extractelement <4 x i32> %b, i32 2<br> %vecext7 = extractelement <4 x i32> %b, i32 3<br> %conv = sitofp i32 %vecext to float<br> %conv2 = sitofp i32 %vecext1 to float<br> %conv5 = sitofp i32 %vecext4 to float<br> %conv8 = sitofp i32 %vecext7 to float<br> %vecinit = insertelement <4 x float> undef, float %conv, i32 0<br> %vecinit3 = insertelement <4 x float> %vecinit, float %conv2, i32 1<br> %vecinit6 = insertelement <4 x float> %vecinit3, float %conv5, i32 2<br> %vecinit9 = insertelement <4 x float> %vecinit6, float %conv8, i32 3<br> ret <4 x float> %vecinit9</span></div><div><span style="font-family:monospace">}</span></div><div><br></div><div>Is<span style="font-family:arial,sans-serif"> equivalent to:<br></span></div><div><span style="font-family:monospace">define dso_local <4 x float> @testv(<4 x i32> %a, <4 x i32> %b) {<br>entry:<br> %shuffle = shufflevector <4 x i32> %a, <4 x i32> %b, <4 x i32> <i32 0, i32 1, i32 6, i32 7><br> %0 = sitofp <4 x i32> %shuffle to <4 x float><br> ret <4 x float> %0<br>}</span></div><div><span style="font-family:arial,sans-serif"><br></span></div><div><span style="font-family:arial,sans-serif">Of course, this is something we can handle in InstCombine, but I am wondering if we may again be missing situations where it is the DAG legalizer that creates the scalarization code.</span><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sat, Jan 11, 2020 at 8:25 AM Simon Pilgrim via llvm-dev <<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
<p>Performing vectorization in the SelectionDAG tends to be fraught
with problems as we don't have a cost model mechanism to control
it; we hit this in <a href="https://bugs.llvm.org/show_bug.cgi?id=35732" target="_blank">https://bugs.llvm.org/show_bug.cgi?id=35732</a>
where conversion ops were being vectorized in DAG whether we
wanted to or not, purely based on a isOperationLegalOrCustom call
(just because an op is possible doesn't mean its efficient to do
so under all circumstances).<br>
</p>
<p>In general I think we should be relying on the SLPVectorizer to
handle this instead. If SLP is failing, we're probably better off
spending dev time fixing it there instead of trying to cleanup in
DAG (e.g. too often I find its a case of tweaking the TTI cost
models).</p>
<p>Having said that, we do hit cases where vectorizable patterns do
appear during legalization - x86 does some very limited bitwise op
vectorization in lowerBuildVectorToBitOp as this was such a common
occurrence. So there's probably scope to provide some common
helper functions in SelectionDAG.cpp to recognise vectorizable
patterns (similar to SelectionDAG::matchBinOpReduction) but we
leave it up to the target's build_vector combiner/lowering to
actually decide whether to make use of them.</p>
<p>Simon.<br>
</p>
<div>On 10/01/2020 21:49, Nemanja Ivanovic
via llvm-dev wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div>I have added a few PPC-specific DAG combines in the past
that follow this pattern on specific operations. Now that it
appears that this would be useful to do on yet another
operation, I'm wondering what people think about doing this in
the target-independent DAG Combiner for any legal/custom
operation on the target.</div>
<div><br>
</div>
<div>TL; DR;</div>
<div>The generic pattern would look like this:</div>
<div>
<pre style="box-sizing:inherit;margin:4px 0px;padding:8px;font-size:12px;line-height:1.50001;font-variant-ligatures:none;white-space:pre-wrap;word-break:normal;font-family:Monaco,Menlo,Consolas,"Courier New",monospace;border-radius:4px;color:rgb(29,28,29);font-style:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:left;text-indent:0px;text-transform:none;word-spacing:0px;text-decoration-style:initial;text-decoration-color:initial"><span style="font-family:monospace">(build_vector (op (extractelt %a, 0), [(extractelt %b, 0)]...),
(op (extractelt %a, 1), [(extractelt %b, 1)]...), ...)</span>
</pre>
<pre style="box-sizing:inherit;margin:4px 0px;padding:8px;font-size:12px;line-height:1.50001;font-variant-ligatures:none;white-space:pre-wrap;word-break:normal;font-family:Monaco,Menlo,Consolas,"Courier New",monospace;border-radius:4px;color:rgb(29,28,29);font-style:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:left;text-indent:0px;text-transform:none;word-spacing:0px;text-decoration-style:initial;text-decoration-color:initial"><span style="font-family:arial,sans-serif">Basically, if the build vector is built from the same operation applied on elements extracted from another vector (or pair of vectors for binary ops), then we can check for the legality of the operation on the vector type. If the operation is legal for the vector type (and the operand and result vector types are the same), then we can just convert this to
<span style="font-family:monospace">
(op %a [, %b])</span></span></pre>
<pre style="box-sizing:inherit;margin:4px 0px;padding:8px;font-size:12px;line-height:1.50001;font-variant-ligatures:none;white-space:pre-wrap;word-break:normal;font-family:Monaco,Menlo,Consolas,"Courier New",monospace;border-radius:4px;color:rgb(29,28,29);font-style:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:left;text-indent:0px;text-transform:none;word-spacing:0px;text-decoration-style:initial;text-decoration-color:initial"><span style="font-family:arial,sans-serif">Which is likely going to produce better code in all cases. Of course, things like this have a tendency to not be better in all cases, but I think we can probably account for any corner cases that arise.
Example:
<span style="font-family:monospace">(v2i64 build_vector (mulhs (extractelt %a, 0), (extractelt %b, 0)),
(mulhs (extractelt %a, 1), (extractelt %b, 1)))</span>
Can be converted to the following on a target that has the operation available on vectors.
<span style="font-family:monospace">(v2i64 mulhs %a, %b)</span></span></pre>
<pre style="box-sizing:inherit;margin:4px 0px;padding:8px;font-size:12px;line-height:1.50001;font-variant-ligatures:none;white-space:pre-wrap;word-break:normal;font-family:Monaco,Menlo,Consolas,"Courier New",monospace;border-radius:4px;color:rgb(29,28,29);font-style:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:left;text-indent:0px;text-transform:none;word-spacing:0px;text-decoration-style:initial;text-decoration-color:initial"><span style="font-family:arial,sans-serif">A further improvement might be if the order of extracted elements doesn't match the order in the build_vector, that we shuffle one or both input vectors prior to applying the operation.
</span></pre>
<pre style="box-sizing:inherit;margin:4px 0px;padding:8px;font-size:12px;line-height:1.50001;font-variant-ligatures:none;white-space:pre-wrap;word-break:normal;font-family:Monaco,Menlo,Consolas,"Courier New",monospace;border-radius:4px;color:rgb(29,28,29);font-style:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:left;text-indent:0px;text-transform:none;word-spacing:0px;text-decoration-style:initial;text-decoration-color:initial"><span style="font-family:arial,sans-serif">If you think this is a good idea (or a terrible idea), please let me know.
</span></pre>
</div>
</div>
<br>
<fieldset></fieldset>
<pre>_______________________________________________
LLVM Developers mailing list
<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a>
<a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a>
</pre>
</blockquote>
</div>
_______________________________________________<br>
LLVM Developers mailing list<br>
<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a><br>
<a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" rel="noreferrer" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><br>
</blockquote></div>
_______________________________________________<br>
LLVM Developers mailing list<br>
<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a><br>
<a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" rel="noreferrer" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><br>
</blockquote></div></div>-- <br><div dir="ltr">~Craig</div>
</blockquote></div>
_______________________________________________<br>
LLVM Developers mailing list<br>
<a href="mailto:llvm-dev@lists.llvm.org" target="_blank">llvm-dev@lists.llvm.org</a><br>
<a href="https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev" rel="noreferrer" target="_blank">https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev</a><br>
</blockquote></div>
</blockquote></div>