[llvm] [BPF] expand mem intrinsics (memcpy, memmove, memset) (PR #97648)

Fri Jul 5 09:22:31 PDT 2024

yonghong-song wrote:

If the user has a call to memcpy which cannot be simply unrolled, the current behavior is to issue an error. For example,
```
$ cat test1.c
#include <stdint.h>
#include <string.h>
typedef struct {
    unsigned char x[8];
} buf_t;
void f(buf_t *buf, uint64_t y, uint64_t z) {
    if (z > 8) z = 8;
    unsigned char *y_bytes = (unsigned char *)&y;
    memcpy(buf->x, y_bytes, z);
}
```
With current compiler (llvm18), we will have
```
/* I add gnu/stubs-32.h in the current directory to ensure compilation pass */
$ clang -O2 -target bpf -I . test1.c -c
test1.c:6:6: error: A call to built-in function 'memcpy' is not supported.
    6 | void f(buf_t *buf, uint64_t y, uint64_t z) {
      |      ^
1 error generated.
```

But with this patch, I see
```
0000000000000000 <f>:
       0:       7b 2a f8 ff 00 00 00 00 *(u64 *)(r10 - 0x8) = r2
       1:       b7 04 00 00 08 00 00 00 r4 = 0x8
       2:       bf 32 00 00 00 00 00 00 r2 = r3
       3:       2d 34 01 00 00 00 00 00 if r4 > r3 goto +0x1 <f+0x28>
       4:       b7 02 00 00 08 00 00 00 r2 = 0x8
       5:       15 02 0b 00 00 00 00 00 if r2 == 0x0 goto +0xb <f+0x88>
       6:       b7 02 00 00 00 00 00 00 r2 = 0x0
       7:       bf 15 00 00 00 00 00 00 r5 = r1
       8:       0f 25 00 00 00 00 00 00 r5 += r2
       9:       bf a0 00 00 00 00 00 00 r0 = r10
      10:       07 00 00 00 f8 ff ff ff r0 += -0x8
      11:       0f 20 00 00 00 00 00 00 r0 += r2
      12:       71 00 00 00 00 00 00 00 r0 = *(u8 *)(r0 + 0x0)
      13:       73 05 00 00 00 00 00 00 *(u8 *)(r5 + 0x0) = r0
      14:       07 02 00 00 01 00 00 00 r2 += 0x1
      15:       3d 32 01 00 00 00 00 00 if r2 >= r3 goto +0x1 <f+0x88>
      16:       2d 24 f6 ff 00 00 00 00 if r4 > r2 goto -0xa <f+0x38>
      17:       95 00 00 00 00 00 00 00 exit
```
basically memcpy is 'inlined' by the compiler.

This is probably not what we want for test1.c since memcpy call is from user. In such cases, we would like user to explicitly write the loop to have maximum performance (e.g., load/store with u64 and remaining with u8 etc).

Maybe somehow we should prevent memcpy generation if it cannot be fully unrolled later (meaning no loops)? We have some backend hooks at IR level, maybe we can leverage them?

https://github.com/llvm/llvm-project/pull/97648