[llvm] [SLP]Improve/fix subvectors in gather/buildvector nodes handling (PR #104144)
Danial Klimkin via llvm-commits
llvm-commits at lists.llvm.org
Thu Aug 29 08:26:51 PDT 2024
dklimkin wrote:
Stack trace:
```
PC: @ 0x7f902096334c (unknown) llvm::slpvectorizer::BoUpSLP::ShuffleInstructionBuilder::finalize()
@ 0x7f905174fe80 (unknown) (unknown)
@ 0x7f902096334c 288 llvm::slpvectorizer::BoUpSLP::ShuffleInstructionBuilder::finalize()
@ 0x7f9020934fab 288 llvm::slpvectorizer::BoUpSLP::vectorizeOperand()::$_5::operator()()
@ 0x7f9020931478 240 llvm::slpvectorizer::BoUpSLP::vectorizeOperand()
@ 0x7f9020931a8c 2400 llvm::slpvectorizer::BoUpSLP::vectorizeTree()
@ 0x7f90209311ca 240 llvm::slpvectorizer::BoUpSLP::vectorizeOperand()
@ 0x7f9020931bd9 2400 llvm::slpvectorizer::BoUpSLP::vectorizeTree()
@ 0x7f9020939fb7 736 llvm::slpvectorizer::BoUpSLP::vectorizeTree()
@ 0x7f9020939e17 160 llvm::slpvectorizer::BoUpSLP::vectorizeTree()
@ 0x7f902094ce10 1248 llvm::SLPVectorizerPass::tryToVectorizeList()
@ 0x7f902095201d 304 tryToVectorizeSequence<>()
@ 0x7f902097a9ee 240 llvm::SLPVectorizerPass::vectorizeCmpInsts<>()
@ 0x7f9020948469 640 llvm::SLPVectorizerPass::vectorizeChainsInBlock()
@ 0x7f9020945a11 4848 llvm::SLPVectorizerPass::runImpl()
@ 0x7f902094551c 160 llvm::SLPVectorizerPass::run()
@ 0x7f902565b252 32 llvm::detail::PassModel<>::run()
@ 0x7f8febfa8f19 272 llvm::PassManager<>::run()
@ 0x7f904d8e1152 32 llvm::detail::PassModel<>::run()
@ 0x7f8febfad2a9 320 llvm::ModuleToFunctionPassAdaptor::run()
@ 0x7f904d8e0f12 32 llvm::detail::PassModel<>::run()
@ 0x7f8febfa7e69 272 llvm::PassManager<>::run()
@ 0x7f903bbed862 2880 std::__u::__function::__policy_invoker<>::__call_impl<>()
@ 0x7f9077c1475e 208 xla::gpu::TranslateLLVMToLLVMIR()
@ 0x7f9077c16f2d 656 xla::gpu::CompileTritonToLLVM()
@ 0x7f9077c162cf 336 xla::gpu::TritonWrapper()
@ 0x7f9078036bbb 448 xla::gpu::TritonFusion::GenerateTritonKernelAndWrapper()
@ 0x7f907803852d 688 std::__u::__function::__policy_invoker<>::__call_impl<>()
@ 0x7f90186f3e47 384 xla::gpu::KernelReuseCache::GetWithStatus()
@ 0x7f90186f3c89 128 xla::gpu::KernelReuseCache::GetWithStatus()
@ 0x7f90780374a4 480 xla::gpu::TritonFusion::Emit()
@ 0x7f91aa629614 464 xla::gpu::IrEmitterUnnested::EmitFusion()
@ 0x7f91aa61829b 96 xla::gpu::IrEmitterUnnested::EmitHloComputation()
@ 0x7f91b9e378f0 1824 xla::gpu::CompileModuleToLlvmIr()
@ 0x7f91ba68b22d 2320 xla::gpu::GpuCompiler::CompileToBackendResult()
@ 0x7f91ba68d599 3280 xla::gpu::GpuCompiler::RunBackend()
@ 0x7f908e3bc71c 592 xla::gpu::AutotunerCompileUtil::Compile()
@ 0x7f908e7e6296 240 xla::gpu::GemmFusionAutotunerImpl::CompileAll()::$_6::operator()()
@ 0x7f908e7fa18d 288 std::__u::__function::__policy_invoker<>::__call_impl<>()
@ 0x7f8e64a87efa 112 Eigen::ThreadPoolTempl<>::WorkerLoop()
@ 0x7f8e64a879fa 48 absl::internal_any_invocable::RemoteInvoker<>()
@ 0x7f8c7d3b3c93 256 Thread::ThreadBody()
@ 0x7f90517467db 192 start_thread
@ 0x7f903b33d05f (unknown) clone
```
Looks like invalid mask with values 124 and 126 here:
```
Invalid LLVM IR before optimizations:
Invalid shufflevector operands!
%55 = shufflevector <4 x i32> %35, <4 x i32> poison, <128 x i32> <i32 0, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 124, i32 1, i32 126, i32 1>
```
(The test in question is this: https://github.com/openxla/xla/blob/main/xla/service/gpu/fusions/triton/triton_fusion_emitter_parametrized_test.cc#L65 , if you could run the test, you'll probably be able to collect more useful information. This is significantly out of my normal problem area).
I got ~800 Mb of IR generated and I can't validate if I can share all of it.
https://github.com/llvm/llvm-project/pull/104144
More information about the llvm-commits
mailing list