<html><head><meta http-equiv="Content-Type" content="text/html charset=windows-1252"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;"><div>Hi,</div><div><br></div><div>I am investigating a poor code generation on x86-64 involving a 64-bits structure with two 32-bits fields (in the attached examples float, but similar behavior is exposed with i32, and we can probably generalize that to smaller types too).</div><div>The root cause of the problem is in SROA, although I am not sure we should fix something there. That is why I need your advices.</div><div><br></div><div><br></div><div>** Problem **</div><div><br></div><div>64-bits structures are usually loaded as one chunk of bits and fields are extracted from this chunk.</div><div>Although this may be generally better than loading each field on its own, this can lead to poor code generation when the operations extracting the fields are more expensive than a load or when fancy loads are available.</div><div><br></div><div>More generally, this may happen for smaller size too.</div><div><br></div><div><br></div><div>** Example **</div><div><br></div><div>1. %chunk64 = load i64</div><div>2. %field1trunced = trunc i64 %chunk64 to i32 // < build field1 from chunk</div><div>3. %field1float = bitcast i32 field1trunced to float // < build field1 from chunk</div><div>4. %field2shifted = lshr i64 %chunk64, 32 // < build field2 from chunk</div><div><div>5. %field2trunced = trunc i64 %field2shifter to i32 // < build field2 from chunk</div></div><div>6. %field2 = bitcast i32 %field2trunced to float // < build field2 from chunk</div><div><br></div><div>Scenario #1:</div><div>Floating point registers are on another register bank and register bank moves are almost as expensive as loads (instructions 3. and 6.).</div><div>Cost: ldi64 + 2 int_to_fp vs. 2 ldfloat</div><div><br></div><div>Scenario #2</div><div>Paired loads are available on the target. Truncate and shift instructions are useless (instructions 2., 4., and 5.).</div><div>Cost: ldi64 + 2 trunc + 1 shift vs. 1 ldpair</div><div><br></div><div><br></div><div>** To Reproduce **</div><div><br></div><div>Here is a way to reproduce the poor code generation for x86-64.</div><div><br></div><div>opt -sroa current_input.ll -S -o - | llc -O3 -o -</div><div><br></div><div>You will see 2 <span style="font-family: Menlo; font-size: 11px;">vmovd </span>and 1 <span style="font-family: Menlo; font-size: 11px;">shrq </span>that can be avoided as illustrated with the next command<span style="font-family: Menlo; font-size: 11px;">.</span></div><div><br></div><div>Here is a nicer code produced by modifying the input so that SROA generates friendlier code for this case.</div><div><br></div><div><div>opt -sroa mod_input.ll -S -o - | llc -O3 -o -</div></div><div><br></div><div>Basically the difference between both inputs is that memcpy has not been expanded in mod_input.ll (instcombine normally replaces it). Thus, SROA inserts its own loads to get rid of the memcpy instead of extracting the values from the 64-bits loads.</div><div><br></div><div><br></div><div>** Advices Required **</div><div><br></div><div>SROA generates this extract-fields-from-chunk-of-bits thing.</div><div>However, like I said, I do not think this is generally a bad thing.</div><div><br></div><div>Would it make sense to rewrite the definitions of the involved slices so that SROA breaks them apart when they are loads (and under certain circumstance)?</div><div><br></div><div>More generally, do you think there is something we should do in SROA for this?</div><div><br></div><div>Currently, 32-bits targets (e.g., armv7s) do not suffer this because the legalization of types in selection DAG split the 64-bits loads.</div><div><br></div><div>Should we do something similar for 64-bits targets with the proper target hooks?</div><div>If yes, what hooks?</div><div><br></div><div><br></div><div>Thanks for your help.</div><div><br></div><div>Cheers,</div><br><div apple-content-edited="true">
<div style="color: rgb(0, 0, 0); font-family: Helvetica; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-height: normal; orphans: 2; text-align: -webkit-auto; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; -webkit-text-size-adjust: auto; -webkit-text-stroke-width: 0px; word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; ">-Quentin</div>
</div>
</body></html>