[cfe-dev] [RFC] Loading Bitfields with Smallest Needed Types

Tue May 26 15:54:20 PDT 2020

By default, clang emits all bitfield load/store operations using the width of the entire sequence of bitfield members.  If you look at the LLVM IR for your testcase, all the bitfield operations are i16.  (For thread safety, the C/C++ standards treat a sequence of bitfield members as a single "field".)

If you look at the assembly, though, an "andb $-2, (%rdi)" slips in.  This is specific to the x86 backend: it's narrowing the store to save a couple bytes in the encoding, and a potential decoding stall due to a 2-byte immediate.  Maybe we shouldn't do that, or we should guard it with a better heuristic.

-Eli

-----Original Message-----
From: cfe-dev <cfe-dev-bounces at lists.llvm.org> On Behalf Of Bill Wendling via cfe-dev
Sent: Tuesday, May 26, 2020 3:29 PM
To: cfe-dev at lists.llvm.org Developers <cfe-dev at lists.llvm.org>; llvm-dev <llvm-dev at lists.llvm.org>
Subject: [EXT] [cfe-dev] [RFC] Loading Bitfields with Smallest Needed Types

We're running into an interesting issue with the Linux kernel, and
wanted advice on how to proceed.

Example of what we're talking about: https://godbolt.org/z/ABGySq

The issue is that when working with a bitfield a load may happen
quickly after a store. For instance:

struct napi_gro_cb {
    void *frag;
    unsigned int frag_len;
    u16 flush;
    u16 flush_id;
    u16 count;
    u16 gro_remcsum_start;
    unsigned long age;
    u16 proto;
    u8 same_flow : 1;
    u8 encap_mark : 1;
    u8 csum_valid : 1;
    u8 csum_cnt : 3;
    u8 free : 2;
    u8 is_ipv6 : 1;
    u8 is_fou : 1;
    u8 is_atomic : 1;
    u8 recursive_counter : 4;
    __wsum csum;
    struct sk_buff *last;
};

void dev_gro_receive(struct sk_buff *skb)
{
...
        same_flow = NAPI_GRO_CB(skb)->same_flow;
...
}

Right before the "same_flow = ... ->same_flow;" statement is executed,
a store is made to the bitfield at the end of a called function:

    NAPI_GRO_CB(skb)->same_flow = 1;

The store is a byte:

    orb    $0x1,0x4a(%rbx)

while the read is a word:

    movzwl 0x4a(%r12),%r15d

The problem is that between the store and the load the value hasn't
been retired / placed in the cache. One would expect store-to-load
forwarding to kick in, but on x86 that doesn't happen because x86
requires the store to be of equal or greater size than the load. So
instead the load takes the slow path, causing unacceptable slowdowns.

GCC gets around this by using the smallest load for a bitfield. It
seems to use a byte for everything, at least in our examples. From the
comments, this is intentional, because according to the comments
(which are never wrong) C++0x doesn't allow one to touch bits outside
of the bitfield. (I'm not a language lawyer, but take this to mean
that gcc is trying to minimize which bits it's accessing by using byte
stores and loads whenever possible.)

The question I have is what should we do to fix this issue? Once we
get to LLVM IR, the information saying that we're accessing a bitfield
is gone. We have a few options:

* We can glean this information from how the loaded value is used and
fix this during DAG combine, but it seems a bit brittle.

* We could correct the size of the load during the front-end's code
generation. This benefits from using all of LLVM's passes on the code.

* We could perform the transformation in another place, possible in MIR or MC.

What do people think?

-bw
_______________________________________________
cfe-dev mailing list
cfe-dev at lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev