[cfe-dev] [llvm-dev] [RFC] Loading Bitfields with Smallest Needed Types

Thu May 28 15:42:13 PDT 2020

On Tue, May 26, 2020 at 7:49 PM James Y Knight via llvm-dev
<llvm-dev at lists.llvm.org> wrote:
>
> At least in this test-case, the "bitfield" part of this seems to be a distraction. As Eli notes, Clang has lowered the function to LLVM IR containing consistent i16 operations. Despite that being a different choice from GCC, it should still be correct and consistent.
>
I suspect that this is more prevalent with bitfields as they're more
likely to have the load / bitwise op / store operations done on them,
resulting in an access type that can be shortened. But yes, it's not
specific to just bitfields.

I'm more interested in consistency, to be honest. If the loads and
stores for the bitfields (or other such shorten-able objects) were the
same, then we wouldn't run into the store-to-load forwarding issue on
x86 (I don't know about other platforms, but suspect that consistency
wouldn't hurt). I liked Arthur's idea of accessing the object using
the type size the bitfield was defined with (i8, i16, i256). It would
help with improving the heuristic. The downside is that it could lead
to un-optimal code, but that's the situation we have now, so...

-bw

> Of course that insight does mean it's quite easy to create a test-case with the exact same problematic store->load mismatch which doesn't use bit-fields at all. For example:
> short f2(short *bfs) {
>     *bfs &= ~0x1;
>     g();
>     return *bfs;
> }
>
> creates the same bad sequence:
>         movq    %rdi, %rbx
>         andb    $-2, (%rdi)
>         callq   g()
>         movzwl  (%rbx), %eax
>
> On Tue, May 26, 2020 at 9:30 PM John McCall via llvm-dev <llvm-dev at lists.llvm.org> wrote:
>>
>> On 26 May 2020, at 20:31, Arthur O'Dwyer wrote:
>>
>> On Tue, May 26, 2020 at 7:32 PM John McCall via cfe-dev <
>> cfe-dev at lists.llvm.org> wrote:
>>
>> On 26 May 2020, at 18:28, Bill Wendling via llvm-dev wrote:
>>
>> [...] The store is a byte:
>>
>> orb $0x1,0x4a(%rbx)
>>
>> while the read is a word:
>>
>> movzwl 0x4a(%r12),%r15d
>>
>> The problem is that between the store and the load the value hasn't
>> been retired / placed in the cache. One would expect store-to-load
>> forwarding to kick in, but on x86 that doesn't happen because x86
>> requires the store to be of equal or greater size than the load. So
>> instead the load takes the slow path, causing unacceptable slowdowns.
>>
>> [...]
>>
>> Clang used to generate narrower loads and stores for bit-fields, but a
>> long time ago it was intentionally changed to generate wider loads
>> and stores, IIRC by Chandler. There are some cases where I think the
>> “new” code goes overboard, but in this case I don’t particularly have
>> an issue with the wider loads and stores. I guess we could make a
>> best-effort attempt to stick to the storage-unit size when the
>> bit-fields break evenly on a boundary. But mostly I think the frontend’s
>> responsibility ends with it generating same-size accesses in both
>> places, and if inconsistent access sizes trigger poor performance,
>> the backend should be more careful about intentionally changing access
>> sizes.
>>
>> FWIW, when I was at Green Hills, I recall the rule being "Always use the
>> declared type of the bitfield to govern the size of the read or write."
>> (There was a similar rule for the meaning of `volatile`. I hope I'm not
>> just getting confused between the two. Actually, since of the compilers on
>> Godbolt, only MSVC follows this rule <https://godbolt.org/z/Aq_APH>, I'm
>> *probably* wrong.) That is, if the bitfield is declared `int16_t`, then
>> use 16-bit loads and stores for it; if it's declared `int32_t`, then use
>> 32-bit loads and stores.
>>
>> I’ve always liked MSVC’s bit-field rules as a coherent whole, but they are
>> quite different from the standard Unix rules. On Windows, T x : 3
>> literally allocates an entire T in the structure, and successive
>> bit-fields get packed into that T only if their base type is of the
>> same size (and they haven’t exhausted the original T). So of course
>> all accesses to that bit-field are basically of the full size of the T;
>> there’s no overlap to be concerned with. On Unix, bit-fields will typically
>> get packed together regardless of the base type; the base type does have
>> some influence, but it’s target-specific and somewhat odd.
>>
>> I’d prefer if we degraded to a Windows-like access behavior as much
>> as we can, but it’s not always possible because of that packing.
>>
>> John.
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev