[LLVMdev] [RFC] Poor code generation for paired load

Quentin Colombet qcolombet at apple.com
Mon Aug 12 11:44:52 PDT 2013


On Aug 12, 2013, at 11:40 AM, Eli Friedman <eli.friedman at gmail.com> wrote:

> On Mon, Aug 12, 2013 at 9:59 AM, Quentin Colombet <qcolombet at apple.com> wrote:
>> Hi Eli,
>> 
>> Thanks for the feedbacks.
>> 
>> On Aug 9, 2013, at 8:00 PM, Eli Friedman <eli.friedman at gmail.com> wrote:
>> 
>> On Fri, Aug 9, 2013 at 4:58 PM, Quentin Colombet <qcolombet at apple.com>
>> wrote:
>> 
>> Hi,
>> 
>> I am investigating a poor code generation on x86-64 involving a 64-bits
>> structure with two 32-bits fields (in the attached examples float, but
>> similar behavior is exposed with i32, and we can probably generalize that to
>> smaller types too).
>> The root cause of the problem is in SROA, although I am not sure we should
>> fix something there. That is why I need your advices.
>> 
>> 
>> ** Problem **
>> 
>> 64-bits structures are usually loaded as one chunk of bits and fields are
>> extracted from this chunk.
>> Although this may be generally better than loading each field on its own,
>> this can lead to poor code generation when the operations extracting the
>> fields are more expensive than a load or when “fancy” loads are available.
>> 
>> More generally, this may happen for smaller size too.
>> 
>> 
>> ** Example **
>> 
>> 1. %chunk64 = load i64
>> 2. %field1trunced = trunc i64 %chunk64 to i32          // <— build field1
>> from chunk
>> 3. %field1float = bitcast i32 field1trunced to float       // <— build
>> field1 from chunk
>> 4. %field2shifted = lshr i64 %chunk64, 32                 // <— build field2
>> from chunk
>> 5. %field2trunced = trunc i64 %field2shifter to i32     // <— build field2
>> from chunk
>> 6. %field2 = bitcast i32 %field2trunced to float           // <— build
>> field2 from chunk
>> 
>> Scenario #1:
>> Floating point registers are on another register bank and register bank
>> moves are almost as expensive as loads (instructions 3. and 6.).
>> Cost: ldi64 + 2 int_to_fp vs. 2 ldfloat
>> 
>> Scenario #2
>> Paired loads are available on the target. Truncate and shift instructions
>> are useless (instructions 2., 4., and 5.).
>> Cost: ldi64 + 2 trunc + 1 shift vs. 1 ldpair
>> 
>> 
>> ** To Reproduce **
>> 
>> Here is a way to reproduce the poor code generation for x86-64.
>> 
>> opt -sroa current_input.ll -S -o - | llc -O3 -o -
>> 
>> You will see 2 vmovd and 1 shrq that can be avoided as illustrated with the
>> next command.
>> 
>> Here is a nicer code produced by modifying the input so that SROA generates
>> friendlier code for this case.
>> 
>> opt -sroa mod_input.ll -S -o - | llc -O3 -o -
>> 
>> Basically the difference between both inputs is that memcpy has not been
>> expanded in mod_input.ll (instcombine normally replaces it). Thus, SROA
>> inserts its own loads to get rid of the memcpy instead of extracting the
>> values from the 64-bits loads.
>> 
>> 
>> ** Advices Required **
>> 
>> SROA generates this extract-fields-from-chunk-of-bits thing.
>> However, like I said, I do not think this is generally a bad thing.
>> 
>> Would it make sense to rewrite the definitions of the involved slices so
>> that SROA breaks them apart when they are loads (and under certain
>> circumstance)?
>> 
>> More generally, do you think there is something we should do in SROA for
>> this?
>> 
>> 
>> The load that you want to split is actually the i64 load from memory
>> with SROA knows nothing about.  So the transform you want isn't really
>> related to what SROA does.
>> 
>> Agreed.
>> 
>> 
>> Currently, 32-bits targets (e.g., armv7s) do not suffer this because the
>> legalization of types in selection DAG split the 64-bits loads.
>> 
>> Should we do something similar for 64-bits targets with the proper target
>> hooks?
>> If yes, what hooks?
>> 
>> 
>> Hmm... detecting the pattern in IR isn't particularly hard; see
>> InstCombiner::SliceUpIllegalIntegerPHI for an example of code which
>> detects a similar sort of pattern.  You might want to consider adding
>> something in instcombine.
>> 
>> Thanks for the direction, I will have a look.
>> 
>> 
>> I'm trying to think if there's some scenario where you wouldn't want
>> to rewrite the load into two loads, but I'm having trouble coming up
>> with anything: two scalar loads at a constant offset from each other
>> are pretty easy to detect for the sorts of passes that like to mess
>> with loads.  So we probably just want to declare that two load
>> instructions is the canonical form for loading two floats which are
>> next to each other in memory, and do this transform on IR for all
>> targets.
>> 
>> It is more general than floating point types. For instance, if you replace
>> float by i32 (and fadd by add, of course :)) in the example I had in my
>> initial email, the produced code is also better for x86-64 when you split
>> the loads.
>> Thus, do you think we should split all loads and rely on later passes to
>> combine them, if needed for the current target?
> 
> Right... replace "float" with "arithmetic type".
> 
> You probably want to be careful you don't end up generating
> overlapping loads, and larger loads are generally better if we're not
> doing arithmetic, but otherwise this seems like a good idea.  That
> said, I'd suggest writing it up and starting a new thread on llvmddv.

Thanks Eli.

Will do.

-Quentin

> 
> -Eli

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20130812/badc2834/attachment.html>


More information about the llvm-dev mailing list