[PATCH] D61810: [BPF] add new intrinsics preserve_{array,union,struct}_access_index

Yonghong Song via Phabricator via llvm-commits llvm-commits at lists.llvm.org
Mon May 20 12:27:08 PDT 2019


yonghong-song added a comment.

I did some analysis on the performance side, specifically, # of instructions generated.
My example is cilium at https://github.com/cilium/cilium/tree/master/bpf, bpf program bpf_lxc.c.

(1). First, I made the following change:
-bash-4.4$ git diff
diff --git a/include/llvm/Analysis/TargetTransformInfoImpl.h b/include/llvm/Analysis/TargetTransformInfoImpl.h
index a1e1f9b07aa..544bbc93a4f 100644

- a/include/llvm/Analysis/TargetTransformInfoImpl.h

+++ b/include/llvm/Analysis/TargetTransformInfoImpl.h
@@ -781,6 +781,9 @@ public:

  case Intrinsic::coro_suspend:
  case Intrinsic::coro_param:
  case Intrinsic::coro_subfn_addr:

+    case Intrinsic::preserve_array_access_index:
+    case Intrinsic::preserve_union_access_index:
+    case Intrinsic::preserve_struct_access_index:

    // These intrinsics don't actually represent code after lowering.
    return TTI::TCC_Free;
  }

-bash-4.4$
so all previously inlined functions are all inlined. There is probably a better way to do this than
Just TTI::TCC_Free. The change here is just for demonstration purpose.

(2). Make the following change for cilium so the data array is aligned to 8 to
use store double always.

diff --git a/bpf/lib/icmp6.h b/bpf/lib/icmp6.h
index fcae3d3ed..bb397685d 100644

- a/bpf/lib/icmp6.h

+++ b/bpf/lib/icmp6.h
@@ -224,7 +224,7 @@ static inline __be32 compute_icmp6_csum(char data[80], __u16 payload_len,
 static inline int __icmp6_send_time_exceeded(struct __sk_buff *skb, int nh_off)
 {

  /* FIXME: Fix code below to not require this init */

- char data[80] = {};

+        char data[80] __attribute__((aligned(8))) = {};

   struct icmp6hdr *icmp6hoplim;
   struct ipv6hdr *ipv6hdr;
  char *upper; /* icmp6 or tcp or udp */

(3).
Compile bpf_lxc.c with and without offset relocation, and compare the number of instructions.
Note that bpf_lxc.c does not have bpf_probe_read, so here we purely measure the cost
of preserve_*_access_index in the generated codes.

The function sizes without offset reloc:
-bash-4.4$ llvm-readelf -s bpf_lxc.o | grep FUNC

  826: 0000000000000000   336 FUNC    GLOBAL DEFAULT    3 __send_drop_notify
  829: 0000000000000000   656 FUNC    GLOBAL DEFAULT   27 handle_policy
  830: 0000000000000000   944 FUNC    GLOBAL DEFAULT   29 handle_to_container
  831: 0000000000000000   720 FUNC    GLOBAL DEFAULT   17 handle_xgress
  832: 0000000000000000  1880 FUNC    GLOBAL DEFAULT   15 tail_handle_arp
  833: 0000000000000000 28264 FUNC    GLOBAL DEFAULT   13 tail_handle_ipv4
  834: 0000000000000000 29800 FUNC    GLOBAL DEFAULT   11 tail_handle_ipv6
  835: 0000000000000000  3272 FUNC    GLOBAL DEFAULT    9 tail_icmp6_handle_ns
  836: 0000000000000000  2480 FUNC    GLOBAL DEFAULT    5 tail_icmp6_send_echo_reply
  837: 0000000000000000  3968 FUNC    GLOBAL DEFAULT    7 tail_icmp6_send_time_exceeded
  838: 0000000000000000 11824 FUNC    GLOBAL DEFAULT   23 tail_ipv4_policy
  839: 0000000000000000 12568 FUNC    GLOBAL DEFAULT   25 tail_ipv4_to_endpoint
  840: 0000000000000000 14144 FUNC    GLOBAL DEFAULT   19 tail_ipv6_policy
  841: 0000000000000000 15328 FUNC    GLOBAL DEFAULT   21 tail_ipv6_to_endpoint

The function sizes with offset reloc enabled:
-bash-4.4$ llvm-readelf -s bpf_lxc.o | grep FUNC

  851: 0000000000000000   336 FUNC    GLOBAL DEFAULT    3 __send_drop_notify
  854: 0000000000000000   680 FUNC    GLOBAL DEFAULT   27 handle_policy
  855: 0000000000000000   960 FUNC    GLOBAL DEFAULT   29 handle_to_container
  856: 0000000000000000   744 FUNC    GLOBAL DEFAULT   17 handle_xgress
  857: 0000000000000000  1904 FUNC    GLOBAL DEFAULT   15 tail_handle_arp
  858: 0000000000000000 29048 FUNC    GLOBAL DEFAULT   13 tail_handle_ipv4
  859: 0000000000000000 31808 FUNC    GLOBAL DEFAULT   11 tail_handle_ipv6
  860: 0000000000000000  3280 FUNC    GLOBAL DEFAULT    9 tail_icmp6_handle_ns
  861: 0000000000000000  2528 FUNC    GLOBAL DEFAULT    5 tail_icmp6_send_echo_reply
  862: 0000000000000000  4056 FUNC    GLOBAL DEFAULT    7 tail_icmp6_send_time_exceeded
  863: 0000000000000000 12224 FUNC    GLOBAL DEFAULT   23 tail_ipv4_policy
  864: 0000000000000000 13088 FUNC    GLOBAL DEFAULT   25 tail_ipv4_to_endpoint
  865: 0000000000000000 14792 FUNC    GLOBAL DEFAULT   19 tail_ipv6_policy
  866: 0000000000000000 15992 FUNC    GLOBAL DEFAULT   21 tail_ipv6_to_endpoint

In summary, with relocable gep, we get 0-7% regression.

Look at function handle_policy(),
good code:

  44:       b7 02 00 00 00 00 00 00 r2 = 0
  45:       7b 2a f8 ff 00 00 00 00 *(u64 *)(r10 - 8) = r2
  46:       7b 2a f0 ff 00 00 00 00 *(u64 *)(r10 - 16) = r2
  47:       b7 02 00 00 00 01 00 00 r2 = 256
  48:       7b 2a e8 ff 00 00 00 00 *(u64 *)(r10 - 24) = r2
  49:       87 01 00 00 00 00 00 00 r1 = -r1
  50:       73 1a e8 ff 00 00 00 00 *(u8 *)(r10 - 24) = r1
  51:       bf a2 00 00 00 00 00 00 r2 = r10
  52:       07 02 00 00 e8 ff ff ff r2 += -24
  53:       18 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r1 = 0 ll
  55:       85 00 00 00 01 00 00 00 call 1

regression:

  44:       b7 02 00 00 00 00 00 00 r2 = 0
  45:       7b 2a f8 ff 00 00 00 00 *(u64 *)(r10 - 8) = r2
  46:       7b 2a f0 ff 00 00 00 00 *(u64 *)(r10 - 16) = r2
  47:       7b 2a e8 ff 00 00 00 00 *(u64 *)(r10 - 24) = r2
  48:       87 01 00 00 00 00 00 00 r1 = -r1
  49:       73 1a e8 ff 00 00 00 00 *(u8 *)(r10 - 24) = r1
  50:       71 a1 e9 ff 00 00 00 00 r1 = *(u8 *)(r10 - 23)
  51:       57 01 00 00 fc 00 00 00 r1 &= 252
  52:       47 01 00 00 01 00 00 00 r1 |= 1
  53:       73 1a e9 ff 00 00 00 00 *(u8 *)(r10 - 23) = r1
  54:       bf a2 00 00 00 00 00 00 r2 = r10
  55:       07 02 00 00 e8 ff ff ff r2 += -24
  56:       18 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r1 = 0 ll
  58:       85 00 00 00 01 00 00 00 call 1

Look at function tail_handle_ipv6(),
good code:

  49:       85 00 00 00 19 00 00 00 call 25
  50:       15 07 0e 00 80 00 00 00 if r7 == 128 goto +14 <LBB4_7>
  51:       55 07 54 00 87 00 00 00 if r7 != 135 goto +84 <LBB4_10>
  52:       b7 01 00 00 02 00 00 00 r1 = 2
  53:       63 16 34 00 00 00 00 00 *(u32 *)(r6 + 52) = r1
  54:       b7 01 00 00 0e 00 00 00 r1 = 14
  55:       63 16 30 00 00 00 00 00 *(u32 *)(r6 + 48) = r1
  56:       bf 61 00 00 00 00 00 00 r1 = r6
  57:       18 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r2 = 0 ll
  59:       b7 03 00 00 04 00 00 00 r3 = 4

regression:

  49:       85 00 00 00 19 00 00 00 call 25
  50:       18 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r1 = 0 ll
  52:       63 1a a4 ff 00 00 00 00 *(u32 *)(r10 - 92) = r1
  53:       18 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r1 = 0 ll
  55:       63 1a a0 ff 00 00 00 00 *(u32 *)(r10 - 96) = r1
  56:       18 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r1 = 0 ll
  58:       63 1a 9c ff 00 00 00 00 *(u32 *)(r10 - 100) = r1
  59:       18 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r1 = 0 ll
  61:       63 1a 98 ff 00 00 00 00 *(u32 *)(r10 - 104) = r1
  62:       15 08 0e 00 80 00 00 00 if r8 == 128 goto +14 <LBB4_6>
  63:       55 08 46 00 87 00 00 00 if r8 != 135 goto +70 <LBB4_10>
  64:       b7 01 00 00 02 00 00 00 r1 = 2
  65:       63 16 34 00 00 00 00 00 *(u32 *)(r6 + 52) = r1
  66:       b7 01 00 00 0e 00 00 00 r1 = 14
  67:       63 16 30 00 00 00 00 00 *(u32 *)(r6 + 48) = r1
  68:       bf 61 00 00 00 00 00 00 r1 = r6
  69:       18 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r2 = 0 ll
  71:       b7 03 00 00 04 00 00 00 r3 = 4

Maybe some pathes makes conservative decisions in the case of
intrinsic calls, even if it is marked as no memory access.

I will proceed to do the following:

  . considering how to limit the scope of using intrinsics. we may introduce
    another one for explicit buy-in intrinsic to identify the places where
    preserve_*_access_index is needed. Note that for a typicall bpf program,
    in most places, preserve_*_access_index intrinsics are not needed.
  . if we proceed with this approach, I will also look at the # of insn regressions
   so we can still have reasonable performance even if intrinsics are used.


Repository:
  rL LLVM

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D61810/new/

https://reviews.llvm.org/D61810





More information about the llvm-commits mailing list