[LLVMdev] [RFC] Poor code generation for paired load

Eli Friedman eli.friedman at gmail.com
Fri Aug 9 20:00:55 PDT 2013


On Fri, Aug 9, 2013 at 4:58 PM, Quentin Colombet <qcolombet at apple.com> wrote:
> Hi,
>
> I am investigating a poor code generation on x86-64 involving a 64-bits
> structure with two 32-bits fields (in the attached examples float, but
> similar behavior is exposed with i32, and we can probably generalize that to
> smaller types too).
> The root cause of the problem is in SROA, although I am not sure we should
> fix something there. That is why I need your advices.
>
>
> ** Problem **
>
> 64-bits structures are usually loaded as one chunk of bits and fields are
> extracted from this chunk.
> Although this may be generally better than loading each field on its own,
> this can lead to poor code generation when the operations extracting the
> fields are more expensive than a load or when “fancy” loads are available.
>
> More generally, this may happen for smaller size too.
>
>
> ** Example **
>
> 1. %chunk64 = load i64
> 2. %field1trunced = trunc i64 %chunk64 to i32          // <— build field1
> from chunk
> 3. %field1float = bitcast i32 field1trunced to float       // <— build
> field1 from chunk
> 4. %field2shifted = lshr i64 %chunk64, 32                 // <— build field2
> from chunk
> 5. %field2trunced = trunc i64 %field2shifter to i32     // <— build field2
> from chunk
> 6. %field2 = bitcast i32 %field2trunced to float           // <— build
> field2 from chunk
>
> Scenario #1:
> Floating point registers are on another register bank and register bank
> moves are almost as expensive as loads (instructions 3. and 6.).
> Cost: ldi64 + 2 int_to_fp vs. 2 ldfloat
>
> Scenario #2
> Paired loads are available on the target. Truncate and shift instructions
> are useless (instructions 2., 4., and 5.).
> Cost: ldi64 + 2 trunc + 1 shift vs. 1 ldpair
>
>
> ** To Reproduce **
>
> Here is a way to reproduce the poor code generation for x86-64.
>
> opt -sroa current_input.ll -S -o - | llc -O3 -o -
>
> You will see 2 vmovd and 1 shrq that can be avoided as illustrated with the
> next command.
>
> Here is a nicer code produced by modifying the input so that SROA generates
> friendlier code for this case.
>
> opt -sroa mod_input.ll -S -o - | llc -O3 -o -
>
> Basically the difference between both inputs is that memcpy has not been
> expanded in mod_input.ll (instcombine normally replaces it). Thus, SROA
> inserts its own loads to get rid of the memcpy instead of extracting the
> values from the 64-bits loads.
>
>
> ** Advices Required **
>
> SROA generates this extract-fields-from-chunk-of-bits thing.
> However, like I said, I do not think this is generally a bad thing.
>
> Would it make sense to rewrite the definitions of the involved slices so
> that SROA breaks them apart when they are loads (and under certain
> circumstance)?
>
> More generally, do you think there is something we should do in SROA for
> this?

The load that you want to split is actually the i64 load from memory
with SROA knows nothing about.  So the transform you want isn't really
related to what SROA does.

> Currently, 32-bits targets (e.g., armv7s) do not suffer this because the
> legalization of types in selection DAG split the 64-bits loads.
>
> Should we do something similar for 64-bits targets with the proper target
> hooks?
> If yes, what hooks?

Hmm... detecting the pattern in IR isn't particularly hard; see
InstCombiner::SliceUpIllegalIntegerPHI for an example of code which
detects a similar sort of pattern.  You might want to consider adding
something in instcombine.

I'm trying to think if there's some scenario where you wouldn't want
to rewrite the load into two loads, but I'm having trouble coming up
with anything: two scalar loads at a constant offset from each other
are pretty easy to detect for the sorts of passes that like to mess
with loads.  So we probably just want to declare that two load
instructions is the canonical form for loading two floats which are
next to each other in memory, and do this transform on IR for all
targets.

-Eli




More information about the llvm-dev mailing list