[cfe-dev] [llvm-dev] [RFC] Loading Bitfields with Smallest Needed Types

Bill Wendling via cfe-dev cfe-dev at lists.llvm.org
Tue May 26 17:16:29 PDT 2020


On Tue, May 26, 2020 at 4:38 PM Craig Topper <craig.topper at gmail.com> wrote:
>
> When I spent some time looking at this back in March when Bill mentioned it on IRC. I think I saw a write to one bit in one of the 8-bit pieces and then a read of that bit and a bit from the adjacent byte. So we used a narrow store and then a wider load due to the 2 bits.
>
I think you're referring to same_flow and free in the structure below.
Those both have stores, as does most of the rest of the bitfield (it's
an initialization, which seems like could be done with a few bitwise
operations on the whole bitfield, but I digress). But yeah, in the
case that we have consecutive accesses of bitfields in adjacent bytes,
then a bigger read & store are better.

> ~Craig
>
>
> On Tue, May 26, 2020 at 4:32 PM John McCall via cfe-dev <cfe-dev at lists.llvm.org> wrote:
>>
>>
>>
>> On 26 May 2020, at 18:28, Bill Wendling via llvm-dev wrote:
>>
>> > We're running into an interesting issue with the Linux kernel, and
>> > wanted advice on how to proceed.
>> >
>> > Example of what we're talking about: https://godbolt.org/z/ABGySq
>> >
>> > The issue is that when working with a bitfield a load may happen
>> > quickly after a store. For instance:
>> >
>> > struct napi_gro_cb {
>> >     void *frag;
>> >     unsigned int frag_len;
>> >     u16 flush;
>> >     u16 flush_id;
>> >     u16 count;
>> >     u16 gro_remcsum_start;
>> >     unsigned long age;
>> >     u16 proto;
>> >     u8 same_flow : 1;
>> >     u8 encap_mark : 1;
>> >     u8 csum_valid : 1;
>> >     u8 csum_cnt : 3;
>> >     u8 free : 2;
>> >     u8 is_ipv6 : 1;
>> >     u8 is_fou : 1;
>> >     u8 is_atomic : 1;
>> >     u8 recursive_counter : 4;
>> >     __wsum csum;
>> >     struct sk_buff *last;
>> > };
>> >
>> > void dev_gro_receive(struct sk_buff *skb)
>> > {
>> > ...
>> >         same_flow = NAPI_GRO_CB(skb)->same_flow;
>> > ...
>> > }
>> >
>> > Right before the "same_flow = ... ->same_flow;" statement is executed,
>> > a store is made to the bitfield at the end of a called function:
>> >
>> >     NAPI_GRO_CB(skb)->same_flow = 1;
>> >
>> > The store is a byte:
>> >
>> >     orb    $0x1,0x4a(%rbx)
>> >
>> > while the read is a word:
>> >
>> >     movzwl 0x4a(%r12),%r15d
>> >
>> > The problem is that between the store and the load the value hasn't
>> > been retired / placed in the cache. One would expect store-to-load
>> > forwarding to kick in, but on x86 that doesn't happen because x86
>> > requires the store to be of equal or greater size than the load. So
>> > instead the load takes the slow path, causing unacceptable slowdowns.
>> >
>> > GCC gets around this by using the smallest load for a bitfield. It
>> > seems to use a byte for everything, at least in our examples. From the
>> > comments, this is intentional, because according to the comments
>> > (which are never wrong) C++0x doesn't allow one to touch bits outside
>> > of the bitfield. (I'm not a language lawyer, but take this to mean
>> > that gcc is trying to minimize which bits it's accessing by using byte
>> > stores and loads whenever possible.)
>> >
>> > The question I have is what should we do to fix this issue? Once we
>> > get to LLVM IR, the information saying that we're accessing a bitfield
>> > is gone. We have a few options:
>> >
>> > * We can glean this information from how the loaded value is used and
>> > fix this during DAG combine, but it seems a bit brittle.
>> >
>> > * We could correct the size of the load during the front-end's code
>> > generation. This benefits from using all of LLVM's passes on the code.
>> >
>> > * We could perform the transformation in another place, possible in
>> > MIR or MC.
>> >
>> > What do people think?
>>
>> Clang used to generate narrower loads and stores for bit-fields, but a
>> long time ago it was intentionally changed to generate wider loads
>> and stores, IIRC by Chandler.  There are some cases where I think the
>> “new” code goes overboard, but in this case I don’t particularly
>> have
>> an issue with the wider loads and stores.  I guess we could make a
>> best-effort attempt to stick to the storage-unit size when the
>> bit-fields break evenly on a boundary.  But mostly I think the
>> frontend’s
>> responsibility ends with it generating same-size accesses in both
>> places, and if inconsistent access sizes trigger poor performance,
>> the backend should be more careful about intentionally changing access
>> sizes.

Fair enough.

-bw


More information about the cfe-dev mailing list