[LLVMdev] [RFC] LegalizeDAG support for targets without subword load/store instructions

Sat Jul 16 16:09:32 PDT 2011

On 07/16/2011 04:01 PM, Richard Osborne wrote:
> On 16 Jul 2011, at 03:34, Matt Johnson wrote:
>
>> Hi All,
>>      Some targets don't provide subword (e.g., i8 and i16 for a 32-bit
>> machine) load and store instructions, so currently we have to
>> custom-lower Load- and StoreSDNodes in our backends.  For examples, see
>> LowerLOAD() and LowerSTORE() in {XCore,CellSPU}ISelLowering.cpp.  I
>> believe it's possible to support this lowering in a target-agnostic
>> fashion in LegalizeDAG.cpp, similar to what is done for
>> non-naturally-aligned loads and stores using the
>> allowsUnalignedMemoryAccesses() target hook.
> The XCore does support i8 and i16 loads and stores. As far as I can remember the standard lowering produced functionally correct code for us. We custom lower misaligned loads and stores because we want to produce code that is better optimized for our target.

Thanks for the clarification!  I didn't do enough homework on the XCore 
backend; I initially patterned unaligned support in my own backend after 
the stuff in CellSPU (which *is* for correctness) and grepped around for 
any other backends with similar constructs as I wrote my previous post.

> In particular if a i32 load is from an address known to be a constant offset away from being word aligned it is quicker to load the two 32bit values at aligned addresses which overlap the data and then shift and or these values to form the result.

Smart; this allows you to elide some ADD and AND instructions that you'd 
need if you didn't know where the i32 fell w.r.t. word boundaries.

> Also i32 loads / stores not known to be 32bit or 16bit aligned are expanded to a call to a library function. This can be a big code size win as these operations would otherwise expand to a significant number of instructions.

Also very smart; my initial sketch results in 26 instructions for a 
worst-case i32 store.  Processors that omit subword ops also tend to 
have small i-caches, so the library function seems preferable.

> I'm not sure how this fits in with the changes you want to make. It does sound like the kind of thing that would be good to add to the target independent lowering code,  but I suspect it won't help the XCore backend.

To oversimplify a bit, what I'd like to do is support a superset of what 
XCore does (I would say "CellSPU and XCore", but realistically I think 
just tackling scalar types would be a good first step, and CellSPU is 
vector-centric), and allow the Target to tune the codegen behavior for 
certain types, certain alignments, certain subtargets, etc. to get the 
best performance.  I agree that the benefit to XCore would probably just 
be that the existing lowering code would go away, unless we can find 
some more cases that are currently handled by the libcall that might be 
more efficient to expand inline.

I'd like to allow a target to specify an action ('Legal', 'Expand', 
'Custom' and Libcall (like XCore)) for (type, base pointer alignment, 
base+offset alignment) tuples, I think.  Targets could implement special 
lowerings that don't make sense to put in target-independent codegen, 
but I'd imagine you could handle most of the common cases in a 
target-independent way.

The main thing that I think could make this feature very hard to do well 
is enforcing dependencies properly between loads and stores that happen 
to map onto the same word, even when they can be shown to be to 
different source-level variables.  I wonder if you'd end up having to 
insert a bunch of extra edges between all loads/stores you can't prove 
to be to different words.
> Regards,
>
> Richard
>

-Matt