<div class="gmail_extra"><div class="gmail_quote">On Fri, Aug 31, 2012 at 3:35 PM, Daniel Dunbar <span dir="ltr"><<a href="mailto:daniel@zuster.org" target="_blank" class="cremed">daniel@zuster.org</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="HOEnZb"><div class="h5">On Tue, Aug 28, 2012 at 5:33 PM, Chandler Carruth <<a href="mailto:chandlerc@google.com" class="cremed">chandlerc@google.com</a>> wrote:<br>


> On Tue, Aug 28, 2012 at 5:02 PM, John McCall <<a href="mailto:rjmccall@apple.com" class="cremed">rjmccall@apple.com</a>> wrote:<br>

>><br>

>> On Aug 28, 2012, at 4:34 PM, Chandler Carruth wrote:<br>

>><br>

>> On Tue, Aug 28, 2012 at 3:52 PM, John McCall <<a href="mailto:rjmccall@apple.com" class="cremed">rjmccall@apple.com</a>> wrote:<br>

>>><br>

>>> On Aug 28, 2012, at 3:18 PM, Chandler Carruth wrote:<br>

>>> > I've added an explanatory comment. Hopefully won't get stale.<br>

>>> ><br>

>>> > Any other comments? I'd love to get this into the tree as we're seeing<br>

>>> > actual races in the wild due to it....<br>

>>><br>

>>> If you would like a fast review, perhaps a more minimal change would<br>

>>> have been in order. We're not going to rush into a total rewrite of<br>

>>> bitfield<br>

>>> IR-generation, especially not one that introduces a host of new<br>

>>> assumptions<br>

>>> and dependencies on a very general but poorly-specified IR feature that<br>

>>> has not previously been used in code with binary compatibility<br>

>>> requirements.<br>

>><br>

>><br>

>> So, first off, sorry for being in a hurry. I'm not trying to get a rushed<br>

>> review, just hoping to get review sooner rather than later. =D Anyways, I<br>

>> would *really* like thorough reviews of this. I'm very aware and sensitive<br>

>> to the importance of this code. In fact, that's why I'm working on it.<br>

>><br>

>> Also, I too would have preferred a more minimal change. None seemed<br>

>> readily available when both I and Richard Smith were looking at this in<br>

>> detail.<br>

>><br>

>> The primary idea we had was to bound the strided accesses to the size of<br>

>> the bitfield storage rather than just to the containing struct.<br>

>><br>

>><br>

>> Why not simply bound it to the offset of the next non-bitfield field (or,<br>

>> if there is none, the data size of the type)?<br>

><br>

><br>

> You also have to bound the start, not just the stop. That means reasonably<br>

> often either issuing an unaligned load or peeling off the unaligned chunks.<br>

> In my testing, this happened reasonably often due to the ABI layout not<br>

> requiring alignment in many cases, even when the type would typically<br>

> provide some.<br>

><br>

>>  Is that challenging for some reason I'm not understanding?<br>

><br>

><br>

> No, it's not challenging. As I explained in my previous email, doing this<br>

> isn't hard, but in my testing it would very often result in byte-by-byte<br>

> loads and stores. There seemed to be very good reason to not want that for<br>

> performance reasons (allowing more aggressive combining across the bitfield,<br>

> etc). What would be quite a bit more code (not particularly challenging,<br>

> just a lot of code) would be peeling off the unaligned small chunks on the<br>

> front and the back, and using wide loads and stores for sections entirely<br>

> within the bitfield. It seemed like the *exact* code we would already need<br>

> in LLVM to lower these integers effectively, and so it seemed preferably for<br>

> the frontend to emit a simple single integer load and store, and teach the<br>

> backend where necessary to handle it.<br>

><br>

>><br>

>> I also happened to be hacking on SROA recently, and that pass was in fact<br>

>> forming similar IR structures you're worried about: arbitrarily sized<br>

>> integers with bit insertion and extraction as a "vector" of "bits". I'm not<br>

>> claiming that it produced the exact same structures or that there aren't<br>

>> bugs here. I would be more than happy to provide extra testing and try to<br>

>> flush out any crufty bits of LLVM here. I'm just saying I don't think this<br>

>> pattern is *completely* unused...<br>

>><br>

>> Anyways, does even the vague approach seem good? Should I work on a<br>

>> different strategy? Are there particular tests that I might be able to run?<br>

>> Any of this would be helpful feedback for me here.<br>

>><br>

>><br>

>> I asked Daniel to make a detailed review, since he wrote the current<br>

>> implementation and did a lot of investigation for it.<br>

>><br>

>> The things that worry me most are:<br>

>><br>

>> (1) How well-specified and -tested are the paths for non-power-of-2<br>

>> integer types in LLVM?  Do we actually promise that, say, storing to an i17<br>

>> will not touch the fourth byte?  Do we provide reasonably portable behavior<br>

>> across targets, or is this an instance of the x86 backend having received a<br>

>> lot of attention and everything else being a festering morass of latent<br>

>> issues?  I know that SROA's behavior has exposed bugs before (IIRC with the<br>

>> occasional response of making SROA less aggressive), not least of which<br>

>> because SROA's behavior doesn't really kick in a lot.<br>

><br>

><br>

> Sure, this is a great point. I'm happy to do any testing I can here. One<br>

> thing that I think I've already done is ensure that we only produce<br>

> i8-multiples, not i17s and such. I'll double check, but I would not expect<br>

> to see any of those. My hope was that this would in the worst cased be<br>

> trivially legalized to a sequence of i8 operations, and in the best case,<br>

> the optimal pattern.<br>

><br>

>> (2) How limiting is this going to be for bitfields on "weird" platforms?<br>

><br>

><br>

> I would not expect this to be a problem, but perhaps I don't know the<br>

> platforms where it will crop up. The intent is to treat this as a literal<br>

> bag-of-bits. Any platform wher ethat works should be well supported.<br>

><br>

>><br>

>>  Do we get anything useful from this change on big-endian targets?<br>

><br>

><br>

> I think so. The code is much less confusing to me now. That is, if I<br>

> understood the original code correctly. I'm not at all sure we have enough<br>

> test coverage of big-endian IR generation to catch bugs, and I know that I'm<br>

> not an expert on their ABI requirements. Maybe someone can tell me what to<br>

> test for or look for in the output?<br>

><br>

>> Will we find ourselves completely reimplementing it to support MS-style<br>

>> packed bitfields?<br>

><br>

><br>

> I don't know anything about MS-style packed bitfields, but I'll go<br>

> digging....<br>

><br>

>> When I surveyed a few other people for their first impressions (including<br>

>> Chris and Daniel), there was also a lot of concern that the backend will<br>

>> often not be able to widen these loads and stores because the high-level<br>

>> information simply isn't there.<br>

><br>

><br>

> Correct. The point is to *not* widen them. ;]<br>

><br>

> From a memory model perspective, the frontend should emit the widest<br>

> possible loads and stores. That tells the backend it has exactly that much<br>

> latitude. This change actually helps this in several cases by emitting<br>

> *wider* loads and stores and thus clarifying that there cannot be a race<br>

> across that memory.<br>

<br>

</div></div>We also want to know the optimizer will narrow these when necessary<br>

(current code also has that problem, but to a less severe degree). If<br>

you are interested, here is a test case we don't currently<br>

subsequently narrow:<br>

--<br>

#pragma pack(1)<br>

struct s0 {<br>

  unsigned a : 9;<br>

  unsigned b0 : 5;<br>

  unsigned b1 : 5;<br>

  unsigned b2 : 5;<br>

  unsigned b3 : 5;<br>

  unsigned b4 : 5;<br>

  unsigned b5 : 5;<br>

  unsigned b6 : 5;<br>

  unsigned b7 : 5;<br>

  unsigned b8 : 5;<br>

  unsigned b9 : 5;<br>

  unsigned b10 : 5;<br>

  unsigned b11 : 5;<br>

  char c;<br>

};<br>

<br>

unsigned f0(struct s0 *p) {<br>

  return p->b8;<br>

}<br>

--<br>

which leaves us with a 64-bit load.</blockquote><div><br></div><div>Yea, this ends up with the same 64-bit load. Is that really bad? I suppose it might be nice to teach the backend to narrow to a single byte load where we can if only for the encoding shrink.</div>

<div><br></div><div>I'll add this as a test case if you like, but it'll have to be a split test case, one in clang to produce the giant integer operation, and one in llvm to test how it lowers it. </div></div></div>