<table border="1" cellspacing="0" cellpadding="8">
    <tr>
        <th>Issue</th>
        <td>
            <a href=https://github.com/llvm/llvm-project/issues/97945>97945</a>
        </td>
    </tr>

    <tr>
        <th>Summary</th>
        <td>
            [arm/aarch64] Should LD4 be used to load multiple (vector) constants at once?
        </td>
    </tr>

    <tr>
      <th>Labels</th>
      <td>
            new issue
      </td>
    </tr>

    <tr>
      <th>Assignees</th>
      <td>
      </td>
    </tr>

    <tr>
      <th>Reporter</th>
      <td>
          Validark
      </td>
    </tr>
</table>

<pre>
    Consider the following code:

```zig
const std = @import("std");

export fn foo(indices: @Vector(16, u8)) @Vector(16, u8) {
    const iota: [64]u8 = std.simd.iota(u8, 64); // counts from 0 to 63 inclusive
    return tbl4(iota[0..16].*, iota[16..32].*, iota[32..48].*, iota[48..64].*, indices);
}

fn tbl4(table_part_1: @Vector(16, u8), table_part_2: @Vector(16, u8), table_part_3: @Vector(16, u8), table_part_4: @Vector(16, u8), indices: @Vector(16, u8)) @TypeOf(indices) {
    return struct {
 extern fn @"llvm.aarch64.neon.tbl4"(@TypeOf(table_part_1), @TypeOf(table_part_2), @TypeOf(table_part_3), @TypeOf(table_part_4), @TypeOf(indices)) @TypeOf(indices);
 }.@"llvm.aarch64.neon.tbl4"(table_part_1, table_part_2, table_part_3, table_part_4, indices);
}
```

Here is the emit:

```asm
.LCPI0_0:
        .byte   48
        ...
 .byte   63
.LCPI0_1:
        .byte   32
        ...
 .byte   47
.LCPI0_2:
        .byte   16
        ...
        .byte 31
.LCPI0_3:
        .byte   0
        ...
        .byte 15
foo:
        adrp    x8, .LCPI0_0
        ldr     q4, [x8, :lo12:.LCPI0_0]
        adrp    x8, .LCPI0_1
        ldr     q3, [x8, :lo12:.LCPI0_1]
        adrp    x8, .LCPI0_2
        ldr     q2, [x8, :lo12:.LCPI0_2]
        adrp    x8, .LCPI0_3
        ldr     q1, [x8, :lo12:.LCPI0_3]
        tbl     v0.16b, { v1.16b, v2.16b, v3.16b, v4.16b }, v0.16b
        ret
```

I am not sure if this is correct aarch64 assembly, but couldn't we have instead done this?

```asm
.LCPI0_0:
        .byte   0
        .byte 16
        .byte   32
        .byte   48
        ... ; deinterlaced constants
foo:
        adrp    x8, .LCPI0_0
        ld4     { v1.16b, v2.16b, v3.16b, v4.16b }, [x8, :lo12:.LCPI0_0]
        tbl v0.16b, { v1.16b, v2.16b, v3.16b, v4.16b }, v0.16b
 ret
```

(The particular constants used in this code is just for demonstration purposes. Obviously we could do better in this case, zeroing out numbers higher than 63, with a `cmhi` followed by a `bic`)
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJysll-P2ygQwD8NeRnVwoDt5CEPuxtFV6lST7qqrxU2k5gehhzgbNNPf8LOZpPsOZuqt1oFmxl-HuYPjAxBby3ikhSPpFjNZB9b55dfpdFK-r9ntVOH5ZOzQSv0EFuEjTPGPWu7hcYpJPyB0BWhL78lHf9_6u040zgbIoSogPAVEEF1t3M-EjYnjIWoCGOELQh_PMfgj6QDGwsb5wiba6t0g4Hwh0T4ik10nrB5XhL2BP08AdhiSgSkOsIBAEZ7tItyoBWPpSDFqp8P5oWosqA7lQ1yNk_rn6AUo4VA2JqwNTSutzHAxrsOKEQHJQdtG9MHvcfXL3mMvbcQayPSFhKxeKRZlpekWGWEPST2cTovs4yzt_OcZZmYv50X8ywbDD_NHz107spqde7TzcmUKGuD33bSx2_5TZc-wZkqu1-V368q3lG9O_JfDjv8vDnLlevAH8MRou-beCbCHxG9TblGBCWMGbPvMil905Yis-hsNrqNpZQ9-86FF0djJ8TstpjfFov_EJ9Fe3L3pzwAUq2y9_d2uZ-r0F_H9zqI72Xgy7FwnpB_oEfQYThUsNNx6iiRoRtnsk9Pf36k3-hJEY5_WX2ICABifjWfZceJF42SX6DySRRn76FEdYFik6i8nEJdKPL8gscnefQuXF4cy965NySp_C6NP4bj7eTWCx2j_DD-MwSXFI-jMuEPxuVps6dlxeoOej5B57fp-X10NkFnt-nsPjqfoOe36fwNPdZmGPc0y8t6WFQ9wj5_eduz0xM_PYn0lGp4eBtXXlA9xhtV9hFkB9ZFCH0qtw3EVodUdo3zHpsIx-MAZAjY1eaQPlP3Md1yRlnCqgjPCK3cI2gbIkoFylkcOISvf7Nkr5N5zN3ripkqyum6h3RfK9Q2ojeyQTVe_NLG8Lt1IYbxVyP3KxWU8uR_ypHbyUHY_EuLkA5x3fRG-lcvQR9QgbZjvqRWLyXN9z5E2DgPCruk6WXUzsKu9zsXMGTwud5r1wdzSEkzpBAoBzXGiP6VJgMmS3-id6mRdH0E23c1-gCt3rZDpyltOq3ZEzzr2IIEUtKmazUp6bEFRQX1YRTUukk7Y4uZWnK14As5w2VeMcrFgnIxa5d1sSiRFcgKxvJSqZwu5k1B6SYvqoXi1UwvGWWCVrTKqeCFyGRVVmWF2CDlG5HnRFDspDbZcIk6v53pEHpcLqqFKGZG1mjC0EUzZvEZBmG6WYvVzC_Tmg91vw1EUKNDDK-UqKMZ2m_pO8LWx3IkxQr-agf_fVoJqHGMR3RgnFTQ9SbqnUEgbL4_NkWLs-DJCM42SPh61nuzbGPcpR5qbGC3OrZ9nTUufS_ZcRw-7Lz7jk0kbD1YHwhbj7vbL9m_AQAA__8q3Ei7">