efriedma-quic wrote: Actually, I guess the following is the shortest, at 2 instructions: ``` uint8x8_t load_3byte_insert_byte(char* a) { return vld1_lane_s8(a+2, vld1_dup_u16(a), 2); } ``` https://github.com/llvm/llvm-project/pull/78632