<table border="1" cellspacing="0" cellpadding="8">

    <tr>

        <th>Issue</th>

        <td>

            <a href=https://github.com/llvm/llvm-project/issues/88479>88479</a>

        </td>

    </tr>

    <tr>

        <th>Summary</th>

        <td>

            please file an issue...

        </td>

    </tr>

    <tr>

      <th>Labels</th>

      <td>

            new issue

      </td>

    </tr>

    <tr>

      <th>Assignees</th>

      <td>

      </td>

    </tr>

    <tr>

      <th>Reporter</th>

      <td>

          nyck33

      </td>

    </tr>

</table>

<pre>

    ```bash

(mlir-py39cuda12) nyck33@lenovo-gtx1650:/mnt/d/LLVM/NewPolygeistDir/llvm-project/mlir/test/Integration/GPU/CUDA/TensorCore$ mlir-opt /mnt/d/LLVM/NewPolygeistDir/llvm-project/mlir/test/Integration/GPU/CUDA/TensorCore/wmma-matmul-f32-bare-ptr.mlir | mlir-opt -test-lower-to-nvvm="host-bare-ptr-calling-convention=1 kernel-bare-ptr-calling-convention=1 cubin-chip=sm_75 cubin-format=fatbin" | mlir-cpu-runner --shared-libs=/mnt/d/LLVM/NewPolygeistDir/llvm-project/build/lib/libmlir_cuda_runtime.so --shared-libs=/mnt/d/LLVM/NewPolygeistDir/llvm-project/build/lib/libmlir_runner_utils.so --entry-point-result=void

Success

Success

Success

Success

Success

Success

Success

Success

Success

Success

Success

Success

Success

Success

Success

Success

Success

Success

Success

Success

Success

Success

Success

Success

Success

Success

Success

Success

Success

Success

Success

Success

PLEASE submit a bug report to https://github.com/llvm/llvm-project/issues/ and include the crash backtrace.

Stack dump:

0.      Program arguments: mlir-cpu-runner --shared-libs=/mnt/d/LLVM/NewPolygeistDir/llvm-project/build/lib/libmlir_cuda_runtime.so --shared-libs=/mnt/d/LLVM/NewPolygeistDir/llvm-project/build/lib/libmlir_runner_utils.so --entry-point-result=void

 #0 0x0000558a437b7d3a llvm::sys::PrintStackTrace(llvm::raw_ostream&, int) /mnt/d/LLVM/NewPolygeistDir/llvm-project/llvm/lib/Support/Unix/Signals.inc:723:22

 #1 0x0000558a437b8156 PrintStackTraceSignalHandler(void*) /mnt/d/LLVM/NewPolygeistDir/llvm-project/llvm/lib/Support/Unix/Signals.inc:798:1

 #2 0x0000558a437b55a3 llvm::sys::RunSignalHandlers() /mnt/d/LLVM/NewPolygeistDir/llvm-project/llvm/lib/Support/Signals.cpp:105:20

 #3 0x0000558a437b75d2 SignalHandler(int) /mnt/d/LLVM/NewPolygeistDir/llvm-project/llvm/lib/Support/Unix/Signals.inc:413:1

 #4 0x00007f36669ae520 (/usr/lib/x86_64-linux-gnu/libc.so.6+0x42520)

 #5 0x00007f3663936174 DynamicMemRefType<float>::DynamicMemRefType(UnrankedMemRefType<float> const&) /mnt/d/LLVM/NewPolygeistDir/llvm-project/mlir/include/mlir/ExecutionEngine/CRunnerUtils.h:343:21

 #6 0x00007f3663937145 void impl::printMemRef<float>(UnrankedMemRefType<float>&) /mnt/d/LLVM/NewPolygeistDir/llvm-project/mlir/include/mlir/ExecutionEngine/RunnerUtils.h:238:14

 #7 0x00007f3663934a42 _mlir_ciface_printMemrefF32 /mnt/d/LLVM/NewPolygeistDir/llvm-project/mlir/lib/ExecutionEngine/RunnerUtils.cpp:94:1

 #8 0x00007f3663934c12 printMemrefF32 /mnt/d/LLVM/NewPolygeistDir/llvm-project/mlir/lib/ExecutionEngine/RunnerUtils.cpp:136:1

 #9 0x00007f3666f51263 

#10 0x00007f3666f51291 

#11 0x0000558a443eb8ce compileAndExecute((anonymous namespace)::Options&, mlir::Operation*, llvm::StringRef, (anonymous namespace)::CompileAndExecuteConfig, void**, std::unique_ptr<llvm::TargetMachine, std::default_delete<llvm::TargetMachine>>) /mnt/d/LLVM/NewPolygeistDir/llvm-project/mlir/lib/ExecutionEngine/JitRunner.cpp:217:24

#12 0x0000558a443ebbcf compileAndExecuteVoidFunction((anonymous namespace)::Options&, mlir::Operation*, llvm::StringRef, (anonymous namespace)::CompileAndExecuteConfig, std::unique_ptr<llvm::TargetMachine, std::default_delete<llvm::TargetMachine>>) /mnt/d/LLVM/NewPolygeistDir/llvm-project/mlir/lib/ExecutionEngine/JitRunner.cpp:235:49

#13 0x0000558a443ed307 mlir::JitRunnerMain(int, char**, mlir::DialectRegistry const&, mlir::JitRunnerConfig) /mnt/d/LLVM/NewPolygeistDir/llvm-project/mlir/lib/ExecutionEngine/JitRunner.cpp:394:66

#14 0x0000558a43696dd5 main /mnt/d/LLVM/NewPolygeistDir/llvm-project/mlir/tools/mlir-cpu-runner/mlir-cpu-runner.cpp:33:29

#15 0x00007f3666995d90 (/usr/lib/x86_64-linux-gnu/libc.so.6+0x29d90)

#16 0x00007f3666995e40 __libc_start_main (/usr/lib/x86_64-linux-gnu/libc.so.6+0x29e40)

#17 0x0000558a43696b65 _start (/mnt/d/LLVM/NewPolygeistDir/llvm-project/build/bin/mlir-cpu-runner+0x55e1b65)

Segmentation fault

(mlir-py39cuda12) nyck33@lenovo-gtx1650:/mnt/d/LLVM/NewPolygeistDir/llvm-project/mlir/test/Integration/GPU/CUDA/TensorCore$ 

```

It is the llvm-project version that is pegged to Polygeist as submodule.

My program is `wmma-matmul-f32-bare-ptr.mlir` but with `printMemrefF32` thrown in there where I also copy from device to host near the end.  

```mlir

// Tests memref bare pointer lowering convention both host side and kernel-side;

// this works for only statically shaped memrefs.

// Similar to the wmma-matmul-f32 but but with the memref bare pointer lowering convention.

// This test also uses gpu.memcpy operations (instead of gpu.host_register).

// RUN: mlir-opt %s \

// RUN: | mlir-opt -test-lower-to-nvvm="host-bare-ptr-calling-convention=1 kernel-bare-ptr-calling-convention=1 cubin-chip=sm_70 cubin-format=%gpu_compilation_format" \

// RUN: | mlir-cpu-runner \

// RUN:   --shared-libs=%mlir_cuda_runtime \

// RUN: --entry-point-result=void \

// RUN: | FileCheck %s

//dont' use copilot here leave me alone as I paste a command

//mlir-opt /mnt/d/LLVM/NewPolygeistDir/llvm-project/mlir/test/Integration/GPU/CUDA/TensorCore/wmma-matmul-f32-bare-ptr.mlir | mlir-opt -test-lower-to-nvvm="host-bare-ptr-calling-convention=1 kernel-bare-ptr-calling-convention=1 cubin-chip=sm_75 cubin-format=fatbin" | mlir-cpu-runner --shared-libs=/mnt/d/LLVM/NewPolygeistDir/llvm-project/build/lib/libmlir_cuda_runtime.so --shared-libs=/mnt/d/LLVM/NewPolygeistDir/llvm-project/build/lib/libmlir_runner_utils.so --entry-point-result=void

func.func @main() {

  // Allocate memory for input matrix of 16x16 half-precision floats

  %h0 = memref.alloc() : memref<16x16xf16>

  // Allocate memory for output matrix of 16x16 single-precision floats

  %h_out = memref.alloc() : memref<16x16xf32>

  // Define constants used in the program

  %f1 = arith.constant 1.0e+00 : f16 // Constant value 1.0 of type half-precision float

  %f0 = arith.constant 0.0e+00 : f32 // Constant value 0.0 of type single-precision float

  %c0 = arith.constant 0 : index // Constant value 0 of type index

  %c16 = arith.constant 16 : index // Constant value 16 of type index

  %c32 = arith.constant 32 : index // Constant value 32 of type index

  %c1 = arith.constant 1 : index // Constant value 1 of type index

  // Initialize the input matrix with ones

  scf.for %arg0 = %c0 to %c16 step %c1 {

    scf.for %arg1 = %c0 to %c16 step %c1 {

 memref.store %f1, %h0[%arg0, %arg1] : memref<16x16xf16>

    }

  }

 // Initialize the accumulator matrix with zeros

  scf.for %arg0 = %c0 to %c16 step %c1 {

    scf.for %arg1 = %c0 to %c16 step %c1 {

 memref.store %f0, %h_out[%arg0, %arg1] : memref<16x16xf32>

    }

 }

  // Asynchronous operations token

  %token = gpu.wait async

 // Allocate device memory for input matrix asynchronously

  %0, %t0 = gpu.alloc async [%token] () : memref<16x16xf16>

  // Allocate device memory for output matrix asynchronously

  %out, %t1 = gpu.alloc async [%token]() : memref<16x16xf32>

  // Copy input matrix from host to device asynchronously

  %x = gpu.memcpy async [%token] %0, %h0 : memref<16x16xf16>, memref<16x16xf16>

  // Copy output matrix from host to device asynchronously

  %y = gpu.memcpy async [%token] %out, %h_out : memref<16x16xf32>, memref<16x16xf32>

  // Launch GPU kernel

 //%grid x =1, y = 1, z =1 and block x = 32, y = 1, z = 1 so 1 block of 32 threads, ie. 1 warp

  gpu.launch blocks(%bx, %by, %bz) in (%grid_x = %c1, %grid_y = %c1, %grid_z = %c1)

             threads(%tx, %ty, %tz) in (%block_x = %c32, %block_y = %c1, %block_z = %c1) {

    // Load input matrix A into a subgroup MMA matrix of type AOp

    // load from 0, 0, with lead dim 16 so 16 cols, 16 * 16

    %A = gpu.subgroup_mma_load_matrix %0[%c0, %c0] {leadDimension = 16 : index} : memref<16x16xf16> -> !gpu.mma_matrix<16x16xf16, "AOp">

    // Load input matrix B into a subgroup MMA matrix of type BOp

    // load from 0, 0, with lead dim 16 so 16 cols, 16 * 16

    %B = gpu.subgroup_mma_load_matrix %0[%c0, %c0] {leadDimension = 16 : index} : memref<16x16xf16> -> !gpu.mma_matrix<16x16xf16, "BOp">

    // Load output matrix C into a subgroup MMA matrix of type COp

    // load from 0, 0, with lead dim 16 so 16 cols, 16 * 16

    %C = gpu.subgroup_mma_load_matrix %out[%c0, %c0] {leadDimension = 16 : index} : memref<16x16xf32> -> !gpu.mma_matrix<16x16xf32, "COp">

    // Perform matrix multiplication and accumulation using MMA

    %R = gpu.subgroup_mma_compute %A, %B, %C : !gpu.mma_matrix<16x16xf16, "AOp">, !gpu.mma_matrix<16x16xf16, "BOp"> -> !gpu.mma_matrix<16x16xf32, "COp">

    // Store the result back into the output matrix

 gpu.subgroup_mma_store_matrix %R, %out[%c0, %c0] {leadDimension = 16 : index}: !gpu.mma_matrix<16x16xf32, "COp">, memref<16x16xf32>

    // Print success message to indicate successful execution

    gpu.printf "Success\n"

    // CHECK: Success

    gpu.terminator

  }

 //copy results back to host

  %z = gpu.memcpy async [%token] %out, %h_out : memref<16x16xf32>, memref<16x16xf32>

  // Deallocate device memory for input matrix asynchronously

  %zz = gpu.dealloc async [%token] %0 : memref<16x16xf16>

  // Deallocate device memory for output matrix asynchronously

  %w = gpu.dealloc async [%token] %out : memref<16x16xf32>

  // Wait for all asynchronous operations to complete

  gpu.wait [%token]

  //print memref in host

  // Print the memref after computation.

  call @printMemrefF32(%h_out) : (memref<16x16xf32>) -> ()

  return

}

func.func private @printMemrefF32(memref<16x16xf32>)

//func.func private @printMemrefF32(memref<*xf32>)

```

</pre>

<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJzsWt2T2jgS_2s0LyooW8YGHuaBgbCbu2Q3lY-9R0q226AbWfJJ8gzkr7-SZIPN13xkN5u7ylSKgC11__qnVqu7bao1WwuAWxTfoXhxQ2uzkepW7LL7KLpJZb67RUng_6VUb1CwQMEMkUnJmRpUu2ia1TkNCSJT7CehUcBByAc5WJttmMQBimaILEthEFnmiCzfvfvjPSLL3-Dxg-S7NTBtFkwhsuT8oRxUSv4bMjvWakBkaUDbX2-FgbWihkmByPKXD18QWc6_LKzozyC0VHOpAJERdsBkZfD3U7p8LEs6KKkpaz4oIjJIqYJBZdTQisNoPD-gGljRAy4fQQ2MHIiHhxJFC0TIRmqznzjIKOdMrAeZFA8gHIBoEeJ7UAL4k8OyOmVikG1YhaKFLlfjuLlUSFVSg6JFQU3KBCLkgC6r6oGqhQCFBwO9oQryAWepdvBeQWVaM27Hc5b6T6tlZd1lpWphWAlDLf9KVd6YVW0Y114VCKN2g0oyYQYKdM0tFQ-S5d6tP9VZBlr__PGD_vjw7s3s0xus67RkBlOc1musoJLKYCPxxphK-2CDyHLNzKZOh5ksG3c59RqmdQ0akSWmIsdMZLzOAZsN4ExRvcEpze6NohkMGyyGZvc4r8vKanGXgiF2fx-UXCtaYqrWdQnCWBw_d5XnCCMSBTjYBkEQxPGEjqJxOs4jit2qRDMUzfRO-y8fFBPG8fzZEo_I5DBI0ceV1EYBLRFJEJljZk2avi7Qty7hbPtUV9aLEFl-EWxrL7C1oFwPmchQNBuTCEUzQg72hEf2TMI4wUfgvYxfqcg5KEQmjhEy-z6IpxMUzcIDYHIEOI5pdG4BPtaih1sjMvmzEbdQs8pupDCILbnBAWt07CxxTvAxm99r7Udh1Gdy1KAbF1GSJFMKMQmwY2lZa7UXu50kq2Q04EzU28Fa1P5GNtRymCByF2xHJCYBItOD5LgrOZpGSTge4cVO0JJl76H8CMXnXQUomhdc2kP8jV-y0xFk8kUoKu4hPzsNZ1LY9CZ5NYFNltSEzMOFN1vIapuFvBFrJuyN-UcXLr64aLFB0Swaub3UITQ5MnscjmJsNwtmZcW9jZXdWt6Yrv1PGPp9TDy2kERu740OFo6PLBzREcErH7hZQTNYtfYpKJYR-SbE3v2uw_Q7bzrqe_bkGGYWEvy3IAujpA9t2tt0RRySJMJtLRKFwcntadi53YvWowjSSQY4k2XFOMxE7hGB28MTKqTYlbLWWNASdOVOoan3wt8ri1s3p4-zqrkObZEws3cOcfWTUUysrdeSOX5C-vwY0FyKgq3tzP3R4cRrk_sZtWD_qWFVGYWi-UHpZ6rWYN7TbOP47UzIoaA1N6scOBi4MsnGljffuncurfc_mPFL3qw2Ccf2c3RYMHK8YGlWnC7YH5Lly1pknvkfdvH-75Yrsgf2aHpYruhoufIoGHcY3gt4T23J6Y_uOc42VB2c-jB8wSiHzHyENdNG7Tqn1fyc0Jbo72N85IJmkhyMH_WylWSa5HmMS8rEtzUgpOS6-dmpIU6vtLjcqdpZk7ifpkzjfPqqNIVM8-khTbGSk2PJMArwamWnrbShyqwa61-hC0Z9XeNjctMkxl5Lo-D1ZY3rf5wSfBds4xjCNIn3SD7B2hZ1LkpgtyV__DZYg7Dt3fmfbw1m2lW4XSX4AZS2ppkNdQMqWK8htzX1Hhqm2tXdMq95Ww_7z_c7XDXVL9MYJcHVZhhKApzWBj8ys7GD--mFvWs2Sj4KzCwcUIAf3edbTLmWOJPVDhdKljiHB5aBq_ulNlgAVc4wEPkQH1vvNDdLtrQF_2fQRuPS6cUWIHblKyjsOnNMrPGhpYZTaTZejWY5uG5B04izv1F015NtNkzjR6nuNS6kwlLwHdbWeTLK7dcNrSBvdOthb-onVjJuDZHOliMmHXF78uyAZxrQV_LZ4rPu5CmtNWi8ruphCWVW7bBsz0ONXazWBmiOZeHGWBJWykVmu1emfckfv_y273v4DmysMYrn5wb9AE3R4Lgpiki8ruqVTzQcB6vmHiFP2tFp9FwYis-0ZeKT9s2l2Vf6LNewLRmH-Qaye7ca3Y3rx-bSRqSxdQO7uxiXBrsdx4E-WA_DlEsBdv-_xRXVBjC1qVhJRd6V87Pt_rPt_soGof8sapEN7QdGo6D0maLL6sZNeG2jOsaNl884lxk1LgxKtXPRlomqNrikRrGtDVphsg0TvKG8GFQKMubOOdcf0Adp8SbAKFo04XRIrdxWu41n7jKK5k7YtggTm20_jUXW5hwYzcSaw1U4K1mblyCKyB7REa4FFEyAz6GpMNru87w5XNtzu6O5CJ1aqpjZDNtJOBwGYBOjwCkvwqSVPm9HPFBegx1nzTS7Cs5S3lUUnFMU9BX5fsMZRUFH0Xk6O6qy86qcCiZy2F5SslfhRnUlWgbO0JQ8KTNMLgu11p4KdVevC43IFaRngT6N85zEI9d6K5hhlLOv_plJb-e5DEUK2Hu1zoqh3ROIxFSt_Yr4tTGypVQbqFrQh01_PDd8_txm92gjFXjv9pV8vAlQfNdAaS5ZyShePL3hMUbjxZ6K_dfzpNAsq8uaUyNVj5qvoOQPxk1LhAs-L6KnE3369Oy_HQdLvRPZRkkha93NN428B9HxXvfbmWRzz0fKbBmyE1mf8n3obUqCS6cB7Wjlu46a1kYT7HW5gOtnYM-Fx2YZeOXBcIqufz5chGeXowEYPgPg844JfNj41a7PkyuvXLVjZIv6IrjtHlFTPlzgbE_yJrhCHZk_i1OHuc_ei0Dvngv6QH17JF8i9Qzyy4fyO1qLbIN_-fClySx7Dm2rEMVy7Kh1EcvjdV-_uouuBE25zO79KByRc-NwiLXEYTNSFvakMBsFNNfu8SkMcYgfqapafJYS7sG5Of7pX5xuGxbSXfvlq_Wxtsvj8K62--DThll3eXf-8tfu5ekheuz_9kCtAtMiMC0C00fg4HYgeEL2N05B-Ot9FP3Q2i6WpHl_g9ij00hMsa7TtZJ1hd-_n3XSPHduzn6vTkRxK8p5qtsN7sMdB9yW2DkrbX5gVyzBmWv_zV1OQWY4TLrC4tnegVsIq7KkKyt_1eBwW845dNZuvSxwbj2-s-oWrATh8iXnKJ3UBY2vnIJ4YD8QCd3uKWmjrjfKaSOWAHJ0Mlxi9O45jN79lYze_Q8wened0X5AnD-H0vlfSen8WZTuk40_i1QXdp8ktY0PZN4n9YSOD6Bs9d4yV9bcsIqzzDeDbRze53j2Qm3LEUt1j4qP56nIZFnVxqVfs8b4u-b_ubPtZfvM_X6BG_15JH1yWaRNeH15716d8g5oL_Y8s5l8QobLRDuO8bFh4lsc5AkOz9h3_Rzv-oViwmDtX0_DJWhN164dzUTOXMLX3CtqjqF9rnQQYzG55ndhAbSvucVzgfbvGR2UzX99M_-nNab3OlwrxoAqmbA1xklZ0kssXPPcL5D2K9R0zzup0de_OTVaAP3WdP7rwYYcriby8bVc9AW4npnIPz4b13VC-8j-Zcsii4Jy3tPdr6zc43P3GLmT67mS6qh8OLMqzk_bhw1MHPlMZzt0nknQwoDCPsLRzkMIjDMLFI2OH_24NM4Xn035gsjkgkdN29A16eSOCkytmh12tAEOvcVKsQe7iOf0X1LWbXO_UBIisxM55x7M3eS3UT6NpvQGbsNxSKZBTILgZnNbpCGJ4ymMx1mUUZqOJ-NimhVTyJIwDGl6w25JQEbBKCTBOAiCYBjEYwrpNBiNx_kEAkCjAErK-JDzh3Io1frGvfd6O5mMxtMbTlPg2r39T4iAR-xu2hgUL27UrWv4pvVao1HAmTb6IMUww-G24kA14IJxwFT42cPh8KZW_Pb1L-Q6bP8NAAD___6A9h8">