[LLVMdev] [RFC] Poor code generation for paired load

Mon Aug 12 09:59:27 PDT 2013

Hi Eli,

Thanks for the feedbacks.

On Aug 9, 2013, at 8:00 PM, Eli Friedman <eli.friedman at gmail.com> wrote:

> On Fri, Aug 9, 2013 at 4:58 PM, Quentin Colombet <qcolombet at apple.com> wrote:
>> Hi,
>> 
>> I am investigating a poor code generation on x86-64 involving a 64-bits
>> structure with two 32-bits fields (in the attached examples float, but
>> similar behavior is exposed with i32, and we can probably generalize that to
>> smaller types too).
>> The root cause of the problem is in SROA, although I am not sure we should
>> fix something there. That is why I need your advices.
>> 
>> 
>> ** Problem **
>> 
>> 64-bits structures are usually loaded as one chunk of bits and fields are
>> extracted from this chunk.
>> Although this may be generally better than loading each field on its own,
>> this can lead to poor code generation when the operations extracting the
>> fields are more expensive than a load or when “fancy” loads are available.
>> 
>> More generally, this may happen for smaller size too.
>> 
>> 
>> ** Example **
>> 
>> 1. %chunk64 = load i64
>> 2. %field1trunced = trunc i64 %chunk64 to i32          // <— build field1
>> from chunk
>> 3. %field1float = bitcast i32 field1trunced to float       // <— build
>> field1 from chunk
>> 4. %field2shifted = lshr i64 %chunk64, 32                 // <— build field2
>> from chunk
>> 5. %field2trunced = trunc i64 %field2shifter to i32     // <— build field2
>> from chunk
>> 6. %field2 = bitcast i32 %field2trunced to float           // <— build
>> field2 from chunk
>> 
>> Scenario #1:
>> Floating point registers are on another register bank and register bank
>> moves are almost as expensive as loads (instructions 3. and 6.).
>> Cost: ldi64 + 2 int_to_fp vs. 2 ldfloat
>> 
>> Scenario #2
>> Paired loads are available on the target. Truncate and shift instructions
>> are useless (instructions 2., 4., and 5.).
>> Cost: ldi64 + 2 trunc + 1 shift vs. 1 ldpair
>> 
>> 
>> ** To Reproduce **
>> 
>> Here is a way to reproduce the poor code generation for x86-64.
>> 
>> opt -sroa current_input.ll -S -o - | llc -O3 -o -
>> 
>> You will see 2 vmovd and 1 shrq that can be avoided as illustrated with the
>> next command.
>> 
>> Here is a nicer code produced by modifying the input so that SROA generates
>> friendlier code for this case.
>> 
>> opt -sroa mod_input.ll -S -o - | llc -O3 -o -
>> 
>> Basically the difference between both inputs is that memcpy has not been
>> expanded in mod_input.ll (instcombine normally replaces it). Thus, SROA
>> inserts its own loads to get rid of the memcpy instead of extracting the
>> values from the 64-bits loads.
>> 
>> 
>> ** Advices Required **
>> 
>> SROA generates this extract-fields-from-chunk-of-bits thing.
>> However, like I said, I do not think this is generally a bad thing.
>> 
>> Would it make sense to rewrite the definitions of the involved slices so
>> that SROA breaks them apart when they are loads (and under certain
>> circumstance)?
>> 
>> More generally, do you think there is something we should do in SROA for
>> this?
> 
> The load that you want to split is actually the i64 load from memory
> with SROA knows nothing about.  So the transform you want isn't really
> related to what SROA does.
Agreed.

> 
>> Currently, 32-bits targets (e.g., armv7s) do not suffer this because the
>> legalization of types in selection DAG split the 64-bits loads.
>> 
>> Should we do something similar for 64-bits targets with the proper target
>> hooks?
>> If yes, what hooks?
> 
> Hmm... detecting the pattern in IR isn't particularly hard; see
> InstCombiner::SliceUpIllegalIntegerPHI for an example of code which
> detects a similar sort of pattern.  You might want to consider adding
> something in instcombine.
Thanks for the direction, I will have a look.

> 
> I'm trying to think if there's some scenario where you wouldn't want
> to rewrite the load into two loads, but I'm having trouble coming up
> with anything: two scalar loads at a constant offset from each other
> are pretty easy to detect for the sorts of passes that like to mess
> with loads.  So we probably just want to declare that two load
> instructions is the canonical form for loading two floats which are
> next to each other in memory, and do this transform on IR for all
> targets.
It is more general than floating point types. For instance, if you replace float by i32 (and fadd by add, of course :)) in the example I had in my initial email, the produced code is also better for x86-64 when you split the loads.
Thus, do you think we should split all loads and rely on later passes to combine them, if needed for the current target?

Thanks,

-Quentin
> 
> -Eli

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130812/1c3c851a/attachment.html>