[cfe-commits] PATCH: Large re-working of bitfield IR-gen, and a fix for PR13691

Tue Aug 28 17:33:19 PDT 2012

On Tue, Aug 28, 2012 at 5:02 PM, John McCall <rjmccall at apple.com> wrote:

> On Aug 28, 2012, at 4:34 PM, Chandler Carruth wrote:
>
> On Tue, Aug 28, 2012 at 3:52 PM, John McCall <rjmccall at apple.com> wrote:
>
>> On Aug 28, 2012, at 3:18 PM, Chandler Carruth wrote:
>> > I've added an explanatory comment. Hopefully won't get stale.
>> >
>> > Any other comments? I'd love to get this into the tree as we're seeing
>> actual races in the wild due to it....
>>
>> If you would like a fast review, perhaps a more minimal change would
>> have been in order. We're not going to rush into a total rewrite of
>> bitfield
>> IR-generation, especially not one that introduces a host of new
>> assumptions
>> and dependencies on a very general but poorly-specified IR feature that
>> has not previously been used in code with binary compatibility
>> requirements.
>>
>
> So, first off, sorry for being in a hurry. I'm not trying to get a rushed
> review, just hoping to get review sooner rather than later. =D Anyways, I
> would *really* like thorough reviews of this. I'm very aware and sensitive
> to the importance of this code. In fact, that's why I'm working on it.
>
> Also, I too would have preferred a more minimal change. None seemed
> readily available when both I and Richard Smith were looking at this in
> detail.
>
> The primary idea we had was to bound the strided accesses to the size of
> the bitfield storage rather than just to the containing struct.
>
>
> Why not simply bound it to the offset of the next non-bitfield field (or,
> if there is none, the data size of the type)?
>

You also have to bound the start, not just the stop. That means reasonably
often either issuing an unaligned load or peeling off the unaligned chunks.
In my testing, this happened reasonably often due to the ABI layout not
requiring alignment in many cases, even when the type would typically
provide some.

 Is that challenging for some reason I'm not understanding?
>

No, it's not challenging. As I explained in my previous email, doing this
isn't hard, but in my testing it would very often result in byte-by-byte
loads and stores. There seemed to be very good reason to not want that for
performance reasons (allowing more aggressive combining across the
bitfield, etc). What would be quite a bit more code (not particularly
challenging, just a lot of code) would be peeling off the unaligned small
chunks on the front and the back, and using wide loads and stores for
sections entirely within the bitfield. It seemed like the *exact* code we
would already need in LLVM to lower these integers effectively, and so it
seemed preferably for the frontend to emit a simple single integer load and
store, and teach the backend where necessary to handle it.

> I also happened to be hacking on SROA recently, and that pass was in fact
> forming similar IR structures you're worried about: arbitrarily sized
> integers with bit insertion and extraction as a "vector" of "bits". I'm not
> claiming that it produced the exact same structures or that there aren't
> bugs here. I would be more than happy to provide extra testing and try to
> flush out any crufty bits of LLVM here. I'm just saying I don't think this
> pattern is *completely* unused...
>
> Anyways, does even the vague approach seem good? Should I work on a
> different strategy? Are there particular tests that I might be able to run?
> Any of this would be helpful feedback for me here.
>
>
> I asked Daniel to make a detailed review, since he wrote the current
> implementation and did a lot of investigation for it.
>
> The things that worry me most are:
>
> (1) How well-specified and -tested are the paths for non-power-of-2
> integer types in LLVM?  Do we actually promise that, say, storing to an i17
> will not touch the fourth byte?  Do we provide reasonably portable behavior
> across targets, or is this an instance of the x86 backend having received a
> lot of attention and everything else being a festering morass of latent
> issues?  I know that SROA's behavior has exposed bugs before (IIRC with the
> occasional response of making SROA less aggressive), not least of which
> because SROA's behavior doesn't really kick in a lot.
>

Sure, this is a great point. I'm happy to do any testing I can here. One
thing that I think I've already done is ensure that we only produce
i8-multiples, not i17s and such. I'll double check, but I would not expect
to see any of those. My hope was that this would in the worst cased be
trivially legalized to a sequence of i8 operations, and in the best case,
the optimal pattern.

(2) How limiting is this going to be for bitfields on "weird" platforms?
>

I would not expect this to be a problem, but perhaps I don't know the
platforms where it will crop up. The intent is to treat this as a literal
bag-of-bits. Any platform wher ethat works should be well supported.

>  Do we get anything useful from this change on big-endian targets?
>

I think so. The code is much less confusing to me now. That is, if I
understood the original code correctly. I'm not at all sure we have enough
test coverage of big-endian IR generation to catch bugs, and I know that
I'm not an expert on their ABI requirements. Maybe someone can tell me what
to test for or look for in the output?

Will we find ourselves completely reimplementing it to support MS-style
> packed bitfields?
>

I don't know anything about MS-style packed bitfields, but I'll go
digging....

When I surveyed a few other people for their first impressions (including
> Chris and Daniel), there was also a lot of concern that the backend will
> often not be able to widen these loads and stores because the high-level
> information simply isn't there.
>

Correct. The point is to *not* widen them. ;]

>From a memory model perspective, the frontend should emit the widest
possible loads and stores. That tells the backend it has exactly that much
latitude. This change actually helps this in several cases by emitting
*wider* loads and stores and thus clarifying that there cannot be a race
across that memory.

Hopefully, the optimizer is always going to widen the *load* side of this
-- certainly there are some such optimizations already, but maybe we need
to add more.

There is performance work to be done here though as I mentioned in the
original email. The frontend should further widen stores to include any
padding when that padding is safe to read and write. This should, for
example, allow the common pattern of using an i32 store for a bitfield with
between 17 and 24 bits rather than an i24 store. As I mentioned in my
original email, I'm very happy to add this optimization in the frontend
when it is demonstrably safe, I didn't include it in the original patch
because... well as you point out, it's huge! =]

However, it is a fundamental aspect of the C++ memory model (or LLVM's
which we wrote to support that of C++ and Java among others) that store
widening isn't generally feasible for the reasons you mention. (The cases
where it is, we can usually form an SSA value, obviating the entire thing.)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-commits/attachments/20120828/8cf886f7/attachment.html>