[llvm] [HWASAN] Use sign extension in memToShadow() and untagPointer() (PR #103727)

Wed Aug 28 18:33:45 PDT 2024

SiFiveHolland wrote:

Here's some benchmark numbers:

<details>
<summary>CoreMark results on Cortex-A76 (RK3588)</summary>

Baseline:
```
CoreMark 1.0 : 11116.051578 / Android (12085363, +pgo, +bolt, +lto, +mlgo, based on r530567) Clang 19.0.0 (https://android.googlesource.com/toolchain/llvm-project 97a699bf4812a18fb657c2779f5296a4ab2694d2) -O2 -DPERFORMANCE_RUN=1  -lrt / Heap
CoreMark 1.0 : 11116.051578 / Android (12085363, +pgo, +bolt, +lto, +mlgo, based on r530567) Clang 19.0.0 (https://android.googlesource.com/toolchain/llvm-project 97a699bf4812a18fb657c2779f5296a4ab2694d2) -O2 -DPERFORMANCE_RUN=1  -lrt / Heap
CoreMark 1.0 : 11111.111111 / Android (12085363, +pgo, +bolt, +lto, +mlgo, based on r530567) Clang 19.0.0 (https://android.googlesource.com/toolchain/llvm-project 97a699bf4812a18fb657c2779f5296a4ab2694d2) -O2 -DPERFORMANCE_RUN=1  -lrt / Heap
CoreMark 1.0 : 11125.945705 / Android (12085363, +pgo, +bolt, +lto, +mlgo, based on r530567) Clang 19.0.0 (https://android.googlesource.com/toolchain/llvm-project 97a699bf4812a18fb657c2779f5296a4ab2694d2) -O2 -DPERFORMANCE_RUN=1  -lrt / Heap
CoreMark 1.0 : 11123.470523 / Android (12085363, +pgo, +bolt, +lto, +mlgo, based on r530567) Clang 19.0.0 (https://android.googlesource.com/toolchain/llvm-project 97a699bf4812a18fb657c2779f5296a4ab2694d2) -O2 -DPERFORMANCE_RUN=1  -lrt / Heap

 Performance counter stats for './baseline/libc.so ./baseline/coremark.exe' (5 runs):

         27,979.84 msec task-clock                       #    1.000 CPUs utilized            ( +-  0.02% )
               846      context-switches                 #   30.231 /sec                     ( +-  2.61% )
                 0      cpu-migrations                   #    0.000 /sec                   
             2,710      page-faults                      #   96.840 /sec                   
    63,707,649,465      cycles                           #    2.277 GHz                      ( +-  0.02% )
   211,110,761,288      instructions                     #    3.31  insn per cycle           ( +-  0.00% )
    45,538,007,784      branches                         #    1.627 G/sec                    ( +-  0.00% )
        98,909,714      branch-misses                    #    0.22% of all branches          ( +-  0.28% )

          27.98656 +- 0.00512 seconds time elapsed  ( +-  0.02% )

```

With this patch:
```
CoreMark 1.0 : 11060.723371 / Android (dev, +pgo, +bolt, +lto, +mlgo, based on r530567) Clang 19.0.0 (https://android.googlesource.com/toolchain/llvm-project 97a699bf4812a18fb657c2779f5296a4ab2694d2) -O2 -DPERFORMANCE_RUN=1  -lrt / Heap
CoreMark 1.0 : 11039.355302 / Android (dev, +pgo, +bolt, +lto, +mlgo, based on r530567) Clang 19.0.0 (https://android.googlesource.com/toolchain/llvm-project 97a699bf4812a18fb657c2779f5296a4ab2694d2) -O2 -DPERFORMANCE_RUN=1  -lrt / Heap
CoreMark 1.0 : 11074.197121 / Android (dev, +pgo, +bolt, +lto, +mlgo, based on r530567) Clang 19.0.0 (https://android.googlesource.com/toolchain/llvm-project 97a699bf4812a18fb657c2779f5296a4ab2694d2) -O2 -DPERFORMANCE_RUN=1  -lrt / Heap
CoreMark 1.0 : 11073.583965 / Android (dev, +pgo, +bolt, +lto, +mlgo, based on r530567) Clang 19.0.0 (https://android.googlesource.com/toolchain/llvm-project 97a699bf4812a18fb657c2779f5296a4ab2694d2) -O2 -DPERFORMANCE_RUN=1  -lrt / Heap
CoreMark 1.0 : 11074.197121 / Android (dev, +pgo, +bolt, +lto, +mlgo, based on r530567) Clang 19.0.0 (https://android.googlesource.com/toolchain/llvm-project 97a699bf4812a18fb657c2779f5296a4ab2694d2) -O2 -DPERFORMANCE_RUN=1  -lrt / Heap

 Performance counter stats for './baseline/libc.so ./custom/coremark.exe' (5 runs):

         28,102.82 msec task-clock                       #    0.999 CPUs utilized            ( +-  0.06% )
               849      context-switches                 #   30.188 /sec                     ( +-  0.05% )
                 0      cpu-migrations                   #    0.000 /sec                   
             2,708      page-faults                      #   96.288 /sec                   
    64,007,014,069      cycles                           #    2.276 GHz                      ( +-  0.06% )
   213,868,614,685      instructions                     #    3.34  insn per cycle           ( +-  0.00% )
    45,538,085,229      branches                         #    1.619 G/sec                    ( +-  0.00% )
       102,999,597      branch-misses                    #    0.23% of all branches          ( +-  1.86% )

           28.1261 +- 0.0161 seconds time elapsed  ( +-  0.06% )

```

</details>

It looks like there's a small (<0.5%) performance decrease. This appears to be caused by a loop optimization keeping the intermediate shift result in a register and incrementing it along with the loop control variable, because it incorrectly thinks the shifts are more expensive than maintaining another loop variable. For example, this affects `matrix_mul_vect` in CoreMark. In the snippet below, `x10` and `x11` are completely unnecessary.

```
190:   d377db8b        lsl     x11, x28, #9
194:   d37ffb89        lsl     x9, x28, #1
198:   cb1c030c        sub     x12, x24, x28
19c:   aa0303ed        mov     x13, x3
1a0:   aa0203ee        mov     x14, x2
1a4:   8b02216a        add     x10, x11, x2, lsl #8
1a8:   8b03216b        add     x11, x11, x3, lsl #8
1ac:   934cfd4f        asr     x15, x10, #12 <<<<<<< could be "x15, x2, #4, #52"
1b0:   8b0e0130        add     x16, x9, x14
1b4:   8b090040        add     x0, x2, x9
1b8:   d378fe10        lsr     x16, x16, #56
1bc:   386f6a8f        ldrb    w15, [x20, x15]
1c0:   6b0f021f        cmp     w16, w15
1c4:   54000281        b.ne    214 <matrix_mul_vect+0x214>  // b.any
1c8:   934cfd6f        asr     x15, x11, #12 <<<<<<< could be "x15, x3, #4, #52"
1cc:   8b0d0130        add     x16, x9, x13
1d0:   d378fe10        lsr     x16, x16, #56
1d4:   386f6a91        ldrb    w17, [x20, x15]
1d8:   79c0000f        ldrsh   w15, [x0]
1dc:   8b090060        add     x0, x3, x9
1e0:   6b11021f        cmp     w16, w17
1e4:   540001c1        b.ne    21c <matrix_mul_vect+0x21c>  // b.any
1e8:   79c00010        ldrsh   w16, [x0]
1ec:   91000842        add     x2, x2, #0x2
1f0:   9108014a        add     x10, x10, #0x200
1f4:   910009ce        add     x14, x14, #0x2
1f8:   f100058c        subs    x12, x12, #0x1
1fc:   91000863        add     x3, x3, #0x2
200:   1b0f2208        madd    w8, w16, w15, w8
204:   9108016b        add     x11, x11, #0x200
208:   910009ad        add     x13, x13, #0x2
20c:   54fffd01        b.ne    1ac <matrix_mul_vect+0x1ac>  // b.any
210:   17ffffd4        b       160 <matrix_mul_vect+0x160>
```

https://github.com/llvm/llvm-project/pull/103727