<div dir="ltr"><br><br><div class="gmail_quote"><div dir="ltr">On Thu, Jun 14, 2018 at 5:28 AM Tim Northover <<a href="mailto:t.p.northover@gmail.com" target="_blank">t.p.northover@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">>   * I'd like to see ARM+AArch64+Hexagon move away from the problematic<br>

>   expansion in IR and to have that code deleted from AtomicExpandPass. Are<br>

>   there any objections?<br>

<br>

I think it would be a great shame and I'd like to avoid it if at all<br>

possible, though I'm also uncomfortable with the current situation.<br>

<br>

The problem with late expansion is that even the simplest variants add<br>

some rather unpleasant pseudo-instructions, and I've never even seen<br>

anyone attempt to get good CodeGen out of it (simplest example being<br>

"add xD, xD, #1" in the loop for an increment but there are obviously<br>

more). Doing so would almost certainly involve duplicating a lot of<br>

basic arithmetic instructions into AtomicRMW variants.<br></blockquote><div><br></div><div>I think it would be potentially-useful, and difficult, to also add pseudos for the atomicrmw operations with an immediate operand of the range allowed by the corresponding arithmetic operation. That doesn't seem like it'd be difficult. But I rather doubt whether any further refinement beyond that would be useful in practice.</div><div><br></div><div>For example, sure -- ARM has shifted-add-register operations. And, an<span style="color:rgb(34,34,34);font-family:sans-serif;font-size:13px;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline"> "add r2, r2, r1, lsl #2" could potentially be<span style="color:rgb(34,34,34);font-family:sans-serif;font-size:13px;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline"> (but isn't currently!) generated inside the ll/sc loop for something like:</span></span></div><div><span style="font-family:monospace"><span style="color:rgb(0,0,0);background-color:rgb(255,255,255)">  %shl = shl i32 %incr, 2</span><br>  %0 = atomicrmw add i32* %i, i32 %shl seq_cst<br></span>...but the alternative is simply to generate an lsl instruction before the loop. I'm quite skeptical whether the ability to omit the 'lsl' would really be worth caring about. Plus we aren't even doing it now. :)</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Added to that is the fact that actual CPU implementations are often a<br>

lot less restrictive about what can go into a loop than is required<br>

(for example even spilling is fine on ours as long as the atomic<br>

object is not in the same cache-line; I suspect that's normal). That<br>

casts the issue in a distinctly theoretical light -- we've been doing<br>

this for years and as far as I'm aware nobody has ever hit the issue<br>

in the real world, or even had CodeGen go wrong in a way that *could*<br>

do so outside the -O0 situation</blockquote><div><br></div><div>So, there's two separate issues --</div><div>1. Deterministic failure of a loop even if it's the only thing running. For example, from register spilling to the same cacheline within the loop. As you note, that has been observed, but only at -O0.</div><div>2. Livelock between two such loops on different CPUs. AFAIK this has not been reported, but I expect it would be very difficult to observe, because one CPU will likely receive an OS interrupt eventually, and thus break out of the livelock situation. This would likely exhibit simply as worse performance than expected, which I expect would be very difficult to track down and report.</div><div><br></div><div>I'd suspect that generally, most of the violations of the architectural requirements would only cause #2, not #1. It seems entirely possible to me that this change may well be good both for correctness and performance-under-load.</div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">OTOH that *is* an argument for performance over correctness when you</blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

get right down to it, so I'm not sure I can be too forceful about it.<br>

At least not without a counter-proposal to restore guaranteed<br>

correctness.</blockquote><div><br></div><div>I'd rather look at it from the other direction -- does using a late-lowered atomicrmw pseudo _actually_ cause any observable performance degradation? I rather suspect it won't.</div><div><br></div><div>Now -- cmpxchg, as mentioned before, is trickier, and I could see that the inability to jump to a separate failure block may make a noticeable difference at the moment.</div><div><br></div></div></div>