<html><head><meta http-equiv="Content-Type" content="text/html charset=windows-1252"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;"><br><div><div>On Aug 12, 2013, at 11:40 AM, Eli Friedman <<a href="mailto:eli.friedman@gmail.com">eli.friedman@gmail.com</a>> wrote:</div><br class="Apple-interchange-newline"><blockquote type="cite"><div style="font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;">On Mon, Aug 12, 2013 at 9:59 AM, Quentin Colombet <<a href="mailto:qcolombet@apple.com">qcolombet@apple.com</a>> wrote:<br><blockquote type="cite">Hi Eli,<br><br>Thanks for the feedbacks.<br><br>On Aug 9, 2013, at 8:00 PM, Eli Friedman <<a href="mailto:eli.friedman@gmail.com">eli.friedman@gmail.com</a>> wrote:<br><br>On Fri, Aug 9, 2013 at 4:58 PM, Quentin Colombet <<a href="mailto:qcolombet@apple.com">qcolombet@apple.com</a>><br>wrote:<br><br>Hi,<br><br>I am investigating a poor code generation on x86-64 involving a 64-bits<br>structure with two 32-bits fields (in the attached examples float, but<br>similar behavior is exposed with i32, and we can probably generalize that to<br>smaller types too).<br>The root cause of the problem is in SROA, although I am not sure we should<br>fix something there. That is why I need your advices.<br><br><br>** Problem **<br><br>64-bits structures are usually loaded as one chunk of bits and fields are<br>extracted from this chunk.<br>Although this may be generally better than loading each field on its own,<br>this can lead to poor code generation when the operations extracting the<br>fields are more expensive than a load or when fancy loads are available.<br><br>More generally, this may happen for smaller size too.<br><br><br>** Example **<br><br>1. %chunk64 = load i64<br>2. %field1trunced = trunc i64 %chunk64 to i32 // < build field1<br>from chunk<br>3. %field1float = bitcast i32 field1trunced to float // < build<br>field1 from chunk<br>4. %field2shifted = lshr i64 %chunk64, 32 // < build field2<br>from chunk<br>5. %field2trunced = trunc i64 %field2shifter to i32 // < build field2<br>from chunk<br>6. %field2 = bitcast i32 %field2trunced to float // < build<br>field2 from chunk<br><br>Scenario #1:<br>Floating point registers are on another register bank and register bank<br>moves are almost as expensive as loads (instructions 3. and 6.).<br>Cost: ldi64 + 2 int_to_fp vs. 2 ldfloat<br><br>Scenario #2<br>Paired loads are available on the target. Truncate and shift instructions<br>are useless (instructions 2., 4., and 5.).<br>Cost: ldi64 + 2 trunc + 1 shift vs. 1 ldpair<br><br><br>** To Reproduce **<br><br>Here is a way to reproduce the poor code generation for x86-64.<br><br>opt -sroa current_input.ll -S -o - | llc -O3 -o -<br><br>You will see 2 vmovd and 1 shrq that can be avoided as illustrated with the<br>next command.<br><br>Here is a nicer code produced by modifying the input so that SROA generates<br>friendlier code for this case.<br><br>opt -sroa mod_input.ll -S -o - | llc -O3 -o -<br><br>Basically the difference between both inputs is that memcpy has not been<br>expanded in mod_input.ll (instcombine normally replaces it). Thus, SROA<br>inserts its own loads to get rid of the memcpy instead of extracting the<br>values from the 64-bits loads.<br><br><br>** Advices Required **<br><br>SROA generates this extract-fields-from-chunk-of-bits thing.<br>However, like I said, I do not think this is generally a bad thing.<br><br>Would it make sense to rewrite the definitions of the involved slices so<br>that SROA breaks them apart when they are loads (and under certain<br>circumstance)?<br><br>More generally, do you think there is something we should do in SROA for<br>this?<br><br><br>The load that you want to split is actually the i64 load from memory<br>with SROA knows nothing about. So the transform you want isn't really<br>related to what SROA does.<br><br>Agreed.<br><br><br>Currently, 32-bits targets (e.g., armv7s) do not suffer this because the<br>legalization of types in selection DAG split the 64-bits loads.<br><br>Should we do something similar for 64-bits targets with the proper target<br>hooks?<br>If yes, what hooks?<br><br><br>Hmm... detecting the pattern in IR isn't particularly hard; see<br>InstCombiner::SliceUpIllegalIntegerPHI for an example of code which<br>detects a similar sort of pattern. You might want to consider adding<br>something in instcombine.<br><br>Thanks for the direction, I will have a look.<br><br><br>I'm trying to think if there's some scenario where you wouldn't want<br>to rewrite the load into two loads, but I'm having trouble coming up<br>with anything: two scalar loads at a constant offset from each other<br>are pretty easy to detect for the sorts of passes that like to mess<br>with loads. So we probably just want to declare that two load<br>instructions is the canonical form for loading two floats which are<br>next to each other in memory, and do this transform on IR for all<br>targets.<br><br>It is more general than floating point types. For instance, if you replace<br>float by i32 (and fadd by add, of course :)) in the example I had in my<br>initial email, the produced code is also better for x86-64 when you split<br>the loads.<br>Thus, do you think we should split all loads and rely on later passes to<br>combine them, if needed for the current target?<br></blockquote><br>Right... replace "float" with "arithmetic type".<br><br>You probably want to be careful you don't end up generating<br>overlapping loads, and larger loads are generally better if we're not<br>doing arithmetic, but otherwise this seems like a good idea. That<br>said, I'd suggest writing it up and starting a new thread on llvmddv.<br></div></blockquote><div><br></div><div>Thanks Eli.</div><div><br></div><div>Will do.</div><div><br></div><div>-Quentin</div><br><blockquote type="cite"><div style="font-size: 12px; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: auto; word-spacing: 0px; -webkit-text-stroke-width: 0px;"><br>-Eli</div></blockquote></div><br></body></html>