[PATCH] D111846: [LV] Drop NUW/NSW flags from scalarized instructions that need predication

Mon Oct 18 11:59:59 PDT 2021

dcaballe added a comment.

Thank you all for the quick response!

> A bit surprising that we didn't notice this earlier but I guess masking does not get used that often.

Interestingly, we only observed this issue in AVX512. It's very likely that the cost model is heavily penalizing masking for some ISAs so this scenario wouldn't happen that often.

> I understand that alternatives like extending LoopVectorizationCostModel::isScalarWithPredication or VPRecipeBuilder::handleReplication would be worse. They'd entail the loop to be costed higher or have an extra branch inside the vector loop that we do not seem to need, right?

Yeah, I don't think we should penalize or change the vectorization strategy for these cases since the nuw/nsw flags are actually not impacting the performance of the generated vector code.

================
Comment at: llvm/lib/Transforms/Vectorize/LoopVectorize.cpp:3072
+  if (Replicate && !Replicate->isPredicated() && !State.Instance &&
+      State.VF.isVector() && isa<OverflowingBinaryOperator>(Cloned) &&
+      Legal->blockNeedsPredication(Instr->getParent())) {
----------------
lebedev.ri wrote:
> lebedev.ri wrote:
> > What about `exact` on division?
> (... and other FP fast-math flags)
Good catch, thanks! I can't think of an example that applies to FP fast-math flags. The problem happens when an instruction in a predicated block is scalarized without predicate. I can only think of address computation cases that would fall into that category. Do you have any other case in mind? 

I'll try to write a test for the division case and I can follow up on any other cases we find separately. This is really blocking internal work. Does it sound reasonable?

================
Comment at: llvm/test/Transforms/LoopVectorize/pr52111.ll:3
+
+; Test case for PR52111. Make sure that NUW/NSW flags are dropped from
+; instructions in blocks that need predication and are linearized and masked
----------------
fhahn wrote:
> Can you pre-commit the test?
Do you want me to commit it before the fix and mark it as failure?

================
Comment at: llvm/test/Transforms/LoopVectorize/pr52111.ll:53
+
+attributes #0 = { noinline nounwind uwtable "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "frame-pointer"="none" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="skx" "target-features"="+adx,+aes,+avx,+avx2,+avx512bw,+avx512cd,+avx512dq,+avx512f,+avx512vl,+bmi,+bmi2,+clflushopt,+clwb,+cx16,+cx8,+f16c,+fma,+fsgsbase,+fxsr,+invpcid,+lzcnt,+mmx,+movbe,+pclmul,+pku,+popcnt,+prfchw,+rdrnd,+rdseed,+sahf,+sse,+sse2,+sse3,+sse4.1,+sse4.2,+ssse3,+x87,+xsave,+xsavec,+xsaveopt,+xsaves" "unsafe-fp-math"="false" "use-soft-float"="false" }
+
----------------
fhahn wrote:
> Is this needed?
The problem is that this loop was only vectorized when targeting AVX512. The loop might be vectorized differently if I remove some of these flags. Let me try to simplify all of this and see what happens. 

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D111846/new/

https://reviews.llvm.org/D111846