[llvm] [SLP]Improve/fix subvectors in gather/buildvector nodes handling (PR #104144)

Danial Klimkin via llvm-commits llvm-commits at lists.llvm.org
Thu Aug 29 08:26:51 PDT 2024


dklimkin wrote:

Stack trace:
```
PC: @     0x7f902096334c  (unknown)  llvm::slpvectorizer::BoUpSLP::ShuffleInstructionBuilder::finalize()
    @     0x7f905174fe80  (unknown)  (unknown)
    @     0x7f902096334c        288  llvm::slpvectorizer::BoUpSLP::ShuffleInstructionBuilder::finalize()
    @     0x7f9020934fab        288  llvm::slpvectorizer::BoUpSLP::vectorizeOperand()::$_5::operator()()
    @     0x7f9020931478        240  llvm::slpvectorizer::BoUpSLP::vectorizeOperand()
    @     0x7f9020931a8c       2400  llvm::slpvectorizer::BoUpSLP::vectorizeTree()
    @     0x7f90209311ca        240  llvm::slpvectorizer::BoUpSLP::vectorizeOperand()
    @     0x7f9020931bd9       2400  llvm::slpvectorizer::BoUpSLP::vectorizeTree()
    @     0x7f9020939fb7        736  llvm::slpvectorizer::BoUpSLP::vectorizeTree()
    @     0x7f9020939e17        160  llvm::slpvectorizer::BoUpSLP::vectorizeTree()
    @     0x7f902094ce10       1248  llvm::SLPVectorizerPass::tryToVectorizeList()
    @     0x7f902095201d        304  tryToVectorizeSequence<>()
    @     0x7f902097a9ee        240  llvm::SLPVectorizerPass::vectorizeCmpInsts<>()
    @     0x7f9020948469        640  llvm::SLPVectorizerPass::vectorizeChainsInBlock()
    @     0x7f9020945a11       4848  llvm::SLPVectorizerPass::runImpl()
    @     0x7f902094551c        160  llvm::SLPVectorizerPass::run()
    @     0x7f902565b252         32  llvm::detail::PassModel<>::run()
    @     0x7f8febfa8f19        272  llvm::PassManager<>::run()
    @     0x7f904d8e1152         32  llvm::detail::PassModel<>::run()
    @     0x7f8febfad2a9        320  llvm::ModuleToFunctionPassAdaptor::run()
    @     0x7f904d8e0f12         32  llvm::detail::PassModel<>::run()
    @     0x7f8febfa7e69        272  llvm::PassManager<>::run()
    @     0x7f903bbed862       2880  std::__u::__function::__policy_invoker<>::__call_impl<>()
    @     0x7f9077c1475e        208  xla::gpu::TranslateLLVMToLLVMIR()
    @     0x7f9077c16f2d        656  xla::gpu::CompileTritonToLLVM()
    @     0x7f9077c162cf        336  xla::gpu::TritonWrapper()
    @     0x7f9078036bbb        448  xla::gpu::TritonFusion::GenerateTritonKernelAndWrapper()
    @     0x7f907803852d        688  std::__u::__function::__policy_invoker<>::__call_impl<>()
    @     0x7f90186f3e47        384  xla::gpu::KernelReuseCache::GetWithStatus()
    @     0x7f90186f3c89        128  xla::gpu::KernelReuseCache::GetWithStatus()
    @     0x7f90780374a4        480  xla::gpu::TritonFusion::Emit()
    @     0x7f91aa629614        464  xla::gpu::IrEmitterUnnested::EmitFusion()
    @     0x7f91aa61829b         96  xla::gpu::IrEmitterUnnested::EmitHloComputation()
    @     0x7f91b9e378f0       1824  xla::gpu::CompileModuleToLlvmIr()
    @     0x7f91ba68b22d       2320  xla::gpu::GpuCompiler::CompileToBackendResult()
    @     0x7f91ba68d599       3280  xla::gpu::GpuCompiler::RunBackend()
    @     0x7f908e3bc71c        592  xla::gpu::AutotunerCompileUtil::Compile()
    @     0x7f908e7e6296        240  xla::gpu::GemmFusionAutotunerImpl::CompileAll()::$_6::operator()()
    @     0x7f908e7fa18d        288  std::__u::__function::__policy_invoker<>::__call_impl<>()
    @     0x7f8e64a87efa        112  Eigen::ThreadPoolTempl<>::WorkerLoop()
    @     0x7f8e64a879fa         48  absl::internal_any_invocable::RemoteInvoker<>()
    @     0x7f8c7d3b3c93        256  Thread::ThreadBody()
    @     0x7f90517467db        192  start_thread
    @     0x7f903b33d05f  (unknown)  clone
```

Looks like invalid mask with values 124 and 126 here:
```
Invalid LLVM IR before optimizations:
Invalid shufflevector operands!
  %55 = shufflevector <4 x i32> %35, <4 x i32> poison, <128 x i32> <i32 0, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 124, i32 1, i32 126, i32 1>
```
  
(The test in question is this: https://github.com/openxla/xla/blob/main/xla/service/gpu/fusions/triton/triton_fusion_emitter_parametrized_test.cc#L65 , if you could run the test, you'll probably be able to collect more useful information. This is significantly out of my normal problem area).

I got ~800 Mb of IR generated and I can't validate if I can share all of it.

https://github.com/llvm/llvm-project/pull/104144


More information about the llvm-commits mailing list