[PATCH] D61810: [BPF] add new intrinsics preserve_{array,union,struct}_access_index
Yonghong Song via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Mon May 20 12:27:08 PDT 2019
yonghong-song added a comment.
I did some analysis on the performance side, specifically, # of instructions generated.
My example is cilium at https://github.com/cilium/cilium/tree/master/bpf, bpf program bpf_lxc.c.
(1). First, I made the following change:
-bash-4.4$ git diff
diff --git a/include/llvm/Analysis/TargetTransformInfoImpl.h b/include/llvm/Analysis/TargetTransformInfoImpl.h
index a1e1f9b07aa..544bbc93a4f 100644
- a/include/llvm/Analysis/TargetTransformInfoImpl.h
+++ b/include/llvm/Analysis/TargetTransformInfoImpl.h
@@ -781,6 +781,9 @@ public:
case Intrinsic::coro_suspend:
case Intrinsic::coro_param:
case Intrinsic::coro_subfn_addr:
+ case Intrinsic::preserve_array_access_index:
+ case Intrinsic::preserve_union_access_index:
+ case Intrinsic::preserve_struct_access_index:
// These intrinsics don't actually represent code after lowering.
return TTI::TCC_Free;
}
-bash-4.4$
so all previously inlined functions are all inlined. There is probably a better way to do this than
Just TTI::TCC_Free. The change here is just for demonstration purpose.
(2). Make the following change for cilium so the data array is aligned to 8 to
use store double always.
diff --git a/bpf/lib/icmp6.h b/bpf/lib/icmp6.h
index fcae3d3ed..bb397685d 100644
- a/bpf/lib/icmp6.h
+++ b/bpf/lib/icmp6.h
@@ -224,7 +224,7 @@ static inline __be32 compute_icmp6_csum(char data[80], __u16 payload_len,
static inline int __icmp6_send_time_exceeded(struct __sk_buff *skb, int nh_off)
{
/* FIXME: Fix code below to not require this init */
- char data[80] = {};
+ char data[80] __attribute__((aligned(8))) = {};
struct icmp6hdr *icmp6hoplim;
struct ipv6hdr *ipv6hdr;
char *upper; /* icmp6 or tcp or udp */
(3).
Compile bpf_lxc.c with and without offset relocation, and compare the number of instructions.
Note that bpf_lxc.c does not have bpf_probe_read, so here we purely measure the cost
of preserve_*_access_index in the generated codes.
The function sizes without offset reloc:
-bash-4.4$ llvm-readelf -s bpf_lxc.o | grep FUNC
826: 0000000000000000 336 FUNC GLOBAL DEFAULT 3 __send_drop_notify
829: 0000000000000000 656 FUNC GLOBAL DEFAULT 27 handle_policy
830: 0000000000000000 944 FUNC GLOBAL DEFAULT 29 handle_to_container
831: 0000000000000000 720 FUNC GLOBAL DEFAULT 17 handle_xgress
832: 0000000000000000 1880 FUNC GLOBAL DEFAULT 15 tail_handle_arp
833: 0000000000000000 28264 FUNC GLOBAL DEFAULT 13 tail_handle_ipv4
834: 0000000000000000 29800 FUNC GLOBAL DEFAULT 11 tail_handle_ipv6
835: 0000000000000000 3272 FUNC GLOBAL DEFAULT 9 tail_icmp6_handle_ns
836: 0000000000000000 2480 FUNC GLOBAL DEFAULT 5 tail_icmp6_send_echo_reply
837: 0000000000000000 3968 FUNC GLOBAL DEFAULT 7 tail_icmp6_send_time_exceeded
838: 0000000000000000 11824 FUNC GLOBAL DEFAULT 23 tail_ipv4_policy
839: 0000000000000000 12568 FUNC GLOBAL DEFAULT 25 tail_ipv4_to_endpoint
840: 0000000000000000 14144 FUNC GLOBAL DEFAULT 19 tail_ipv6_policy
841: 0000000000000000 15328 FUNC GLOBAL DEFAULT 21 tail_ipv6_to_endpoint
The function sizes with offset reloc enabled:
-bash-4.4$ llvm-readelf -s bpf_lxc.o | grep FUNC
851: 0000000000000000 336 FUNC GLOBAL DEFAULT 3 __send_drop_notify
854: 0000000000000000 680 FUNC GLOBAL DEFAULT 27 handle_policy
855: 0000000000000000 960 FUNC GLOBAL DEFAULT 29 handle_to_container
856: 0000000000000000 744 FUNC GLOBAL DEFAULT 17 handle_xgress
857: 0000000000000000 1904 FUNC GLOBAL DEFAULT 15 tail_handle_arp
858: 0000000000000000 29048 FUNC GLOBAL DEFAULT 13 tail_handle_ipv4
859: 0000000000000000 31808 FUNC GLOBAL DEFAULT 11 tail_handle_ipv6
860: 0000000000000000 3280 FUNC GLOBAL DEFAULT 9 tail_icmp6_handle_ns
861: 0000000000000000 2528 FUNC GLOBAL DEFAULT 5 tail_icmp6_send_echo_reply
862: 0000000000000000 4056 FUNC GLOBAL DEFAULT 7 tail_icmp6_send_time_exceeded
863: 0000000000000000 12224 FUNC GLOBAL DEFAULT 23 tail_ipv4_policy
864: 0000000000000000 13088 FUNC GLOBAL DEFAULT 25 tail_ipv4_to_endpoint
865: 0000000000000000 14792 FUNC GLOBAL DEFAULT 19 tail_ipv6_policy
866: 0000000000000000 15992 FUNC GLOBAL DEFAULT 21 tail_ipv6_to_endpoint
In summary, with relocable gep, we get 0-7% regression.
Look at function handle_policy(),
good code:
44: b7 02 00 00 00 00 00 00 r2 = 0
45: 7b 2a f8 ff 00 00 00 00 *(u64 *)(r10 - 8) = r2
46: 7b 2a f0 ff 00 00 00 00 *(u64 *)(r10 - 16) = r2
47: b7 02 00 00 00 01 00 00 r2 = 256
48: 7b 2a e8 ff 00 00 00 00 *(u64 *)(r10 - 24) = r2
49: 87 01 00 00 00 00 00 00 r1 = -r1
50: 73 1a e8 ff 00 00 00 00 *(u8 *)(r10 - 24) = r1
51: bf a2 00 00 00 00 00 00 r2 = r10
52: 07 02 00 00 e8 ff ff ff r2 += -24
53: 18 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r1 = 0 ll
55: 85 00 00 00 01 00 00 00 call 1
regression:
44: b7 02 00 00 00 00 00 00 r2 = 0
45: 7b 2a f8 ff 00 00 00 00 *(u64 *)(r10 - 8) = r2
46: 7b 2a f0 ff 00 00 00 00 *(u64 *)(r10 - 16) = r2
47: 7b 2a e8 ff 00 00 00 00 *(u64 *)(r10 - 24) = r2
48: 87 01 00 00 00 00 00 00 r1 = -r1
49: 73 1a e8 ff 00 00 00 00 *(u8 *)(r10 - 24) = r1
50: 71 a1 e9 ff 00 00 00 00 r1 = *(u8 *)(r10 - 23)
51: 57 01 00 00 fc 00 00 00 r1 &= 252
52: 47 01 00 00 01 00 00 00 r1 |= 1
53: 73 1a e9 ff 00 00 00 00 *(u8 *)(r10 - 23) = r1
54: bf a2 00 00 00 00 00 00 r2 = r10
55: 07 02 00 00 e8 ff ff ff r2 += -24
56: 18 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r1 = 0 ll
58: 85 00 00 00 01 00 00 00 call 1
Look at function tail_handle_ipv6(),
good code:
49: 85 00 00 00 19 00 00 00 call 25
50: 15 07 0e 00 80 00 00 00 if r7 == 128 goto +14 <LBB4_7>
51: 55 07 54 00 87 00 00 00 if r7 != 135 goto +84 <LBB4_10>
52: b7 01 00 00 02 00 00 00 r1 = 2
53: 63 16 34 00 00 00 00 00 *(u32 *)(r6 + 52) = r1
54: b7 01 00 00 0e 00 00 00 r1 = 14
55: 63 16 30 00 00 00 00 00 *(u32 *)(r6 + 48) = r1
56: bf 61 00 00 00 00 00 00 r1 = r6
57: 18 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r2 = 0 ll
59: b7 03 00 00 04 00 00 00 r3 = 4
regression:
49: 85 00 00 00 19 00 00 00 call 25
50: 18 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r1 = 0 ll
52: 63 1a a4 ff 00 00 00 00 *(u32 *)(r10 - 92) = r1
53: 18 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r1 = 0 ll
55: 63 1a a0 ff 00 00 00 00 *(u32 *)(r10 - 96) = r1
56: 18 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r1 = 0 ll
58: 63 1a 9c ff 00 00 00 00 *(u32 *)(r10 - 100) = r1
59: 18 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r1 = 0 ll
61: 63 1a 98 ff 00 00 00 00 *(u32 *)(r10 - 104) = r1
62: 15 08 0e 00 80 00 00 00 if r8 == 128 goto +14 <LBB4_6>
63: 55 08 46 00 87 00 00 00 if r8 != 135 goto +70 <LBB4_10>
64: b7 01 00 00 02 00 00 00 r1 = 2
65: 63 16 34 00 00 00 00 00 *(u32 *)(r6 + 52) = r1
66: b7 01 00 00 0e 00 00 00 r1 = 14
67: 63 16 30 00 00 00 00 00 *(u32 *)(r6 + 48) = r1
68: bf 61 00 00 00 00 00 00 r1 = r6
69: 18 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 r2 = 0 ll
71: b7 03 00 00 04 00 00 00 r3 = 4
Maybe some pathes makes conservative decisions in the case of
intrinsic calls, even if it is marked as no memory access.
I will proceed to do the following:
. considering how to limit the scope of using intrinsics. we may introduce
another one for explicit buy-in intrinsic to identify the places where
preserve_*_access_index is needed. Note that for a typicall bpf program,
in most places, preserve_*_access_index intrinsics are not needed.
. if we proceed with this approach, I will also look at the # of insn regressions
so we can still have reasonable performance even if intrinsics are used.
Repository:
rL LLVM
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D61810/new/
https://reviews.llvm.org/D61810
More information about the llvm-commits
mailing list