[LLVMdev] Unaligned vector memory access for ARM/NEON.

Fri Sep 7 10:56:57 PDT 2012

On Sep 6, 2012, at 4:40 PM, David Peixotto <dpeixott at codeaurora.org> wrote:

> -----Original Message-----
> From: Bob Wilson [mailto:bob.wilson at apple.com] 
> Sent: Thursday, September 06, 2012 3:39 PM
> To: David Peixotto
> Cc: 'Peter Couperus'; 'Jim Grosbach'; 'Jakob Olesen'; llvmdev at cs.uiuc.edu
> Subject: Re: [LLVMdev] Unaligned vector memory access for ARM/NEON.
> 
> 
> On Sep 6, 2012, at 2:48 PM, David Peixotto <dpeixott at codeaurora.org> wrote:
> 
>> Hi Pete,
>> 
>> We ran into the same issue with generating vector loads/stores for 
>> vectors with less than word alignment. It seems we took a similar 
>> approach to solving the problem by modifying the logic in
> allowsUnalignedMemoryAccesses.
>> 
>> As you and Jim mentioned, it looks like the vld1/vst1 instructions 
>> should support element aligned access for any armv7 implementation 
>> (I'm looking at Table A3-1 ARM Architecture Reference Manual - ARM DDI
> 0406C).
>> 
>> Right now I do not think we have the correct code setup in 
>> ARMSubtarget to accurately represent this table. I would propose that 
>> we keep the existing field for unaligned access and add a new field for
> element-aligned access.
>> 
>> The AllowsUnAlignedMem field remains as is and it could be used to 
>> represent the SCTLR.A column in Table A3-1. The 
>> AllowsElementAlignedNEON field would be used allow targets to generate 
>> vld1/vst1 instructions for element-aligned accesses. By default it would
> be set to true for armv7 targets with NEON.
> 
> That doesn't make sense to me.  Element-aligned Neon load/stores are always
> valid.  The AllowsUnalignedMem setting is supposed to model the SCTLR.A
> setting, which also applies to Neon load/stores with less-than-element
> alignment.  Why would we need a new field?
> 
> I was proposing it as a way to keep the current behavior of the
> AllowsUnAlignedMem field. Originally I just guarded the use of vld1/vst1 by
> testing whether the target has NEON, but then I was getting a failure in
> test/CodeGen/ARM/unaligned_load_store.ll because it did not want to see a
> vld1 when using the -arm-strict-align even if the target has NEON. 

I think we could just change the test case.  The point of -arm-strict-align is just to avoid relying on hardware or OS support for unaligned accesses.  Using vld1.8/vst1.8 instructions would not require either of those.

> 
> The difference in the flags is between allowing unaligned access for
> half-word and word load/stores (e.g. ldr/str) and getting around an
> unaligned access for D registers by using vld1.8/vst1.8 to load/store.
> 
> It looks to me that a target that supports NEON could support unaligned
> access for D registers by using vld1.8/vst1.8 regardless of whether it
> supports unaligned access for other types. Currently these two are tied
> together into a single flag. The new flag was proposed as a way to make a
> distinction between these.

If it works on all targets with Neon, regardless of the SCTLR.A setting, then I still don't see the need for a separate flag.

Note that this won't work on big-endian targets where changing the vld1/vst1 element size actually changes the behavior of the instructions.

> 
> I guess one way to keep the current behavior would be to look back through
> the store instruction to find the actual vector type (everything gets
> legalized to f64 or v2f64) and allow it for real vector types but disallow
> it for a real f64 type. It seems that it would be good to support the
> unaligned f64 access if we have NEON though.
> 
>> 
>> The -arm-strict-align would set both of the fields to false. This 
>> would retain the behavior that seems to be desired from the 
>> test/CodeGen/ARM/unaligned_load_store.ll test case.
>> 
>> A bit of a grey area is if we have an unaligned f64 store and 
>> AllowsElementAlignedNEON is true. We can actually generate a vstr1.8 
>> to support this store directly instead of using the target-independent 
>> method and I think it would be good to do so.
> 
> Don't we already do this as of svn r161962?
> 
> No, the vstr1 will only be generated if the target supports unaligned
> access. Setting AllowsUnAlignedMem to true will then also allow unaligned
> access using smaller types (e.g. word) which may not be desirable.
> 
> -Dave

If AllowsUnAlignedMem is set, we should take advantage of it; otherwise, for little-endian targets, we can fall back to your suggestion of using vld1.8/vst1.8.

> 
>> 
>> I have some code to do this that I will likely be able to upstream.
>> 
>> -Dave
>> 
>> -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, 
>> hosted by The Linux Foundation
>> 
>> 
>> -----Original Message-----
>> From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu] 
>> On Behalf Of Peter Couperus
>> Sent: Thursday, September 06, 2012 8:14 AM
>> To: Jim Grosbach
>> Cc: Jakob Olesen; llvmdev at cs.uiuc.edu (LLVMdev at cs.uiuc.edu)
>> Subject: Re: [LLVMdev] Unaligned vector memory access for ARM/NEON.
>> 
>> Hello,
>> 
>> Thanks again.  We did try overestimating the alignment, and saw the 
>> vldr you reference here.
>> It looks like a recent change (r161962?) did enable vld1 generation 
>> for this case (great!) on darwin, but not linux.
>> I'm not sure if the effect of lowering load <4 x i16>* align 2 to
>> vld1.16 this was intentional in this change or not.
>> If so, my question is what is the preferable way to inform the 
>> Subtarget that it is allowed to use unaligned vector loads/stores when 
>> NEON is available, but can't use unaligned accesses generally speaking?
>> A new field in ARMSubtarget?
>> Should the -arm-strict-align flag force expansion even on unaligned 
>> vector loads/stores?
>> We got this working by adding a field to ARMSubtarget and changing 
>> logic in ARMTargetLowering::allowsUnalignedMemoryAccesses, but I am 
>> admittedly not entirely sure of the downstream consequences of this, 
>> as we don't allow unaligned access generally.
>> 
>> Pete
>> 
>> 
>> On 09/05/2012 04:58 PM, Jim Grosbach wrote:
>>> Hmmm. Well, it's entirely possible that it's LLVM that's confused 
>>> about the alignment requirements here. :)
>>> 
>>> I think I see, in general, where. I twiddled the IR to give it higher
>> alignment (16 bytes) and get:
>>> extend:                                 @ @extend
>>> @ BB#0:
>>> 	vldr	d16, [r0]
>>> 	vmovl.s16	q8, d16
>>> 	vstmia	r1, {d16, d17}
>>> 	vldr	d16, [r0, #8]
>>> 	add	r0, r1, #16
>>> 	vmovl.s16	q8, d16
>>> 	vstmia	r0, {d16, d17}
>>> 	bx	lr
>>> 
>>> Note that we're using a plain vldr instruction here to load the d
>> register, not a vld1 instruction. Similarly for the stores. According 
>> to the ARM ARM (DDI 0406C),  you're correct about the element size 
>> alignment requirement for VLD1, but our isel isn't attempting to use 
>> that instruction, but rather VLDR, which has word alignment required, so
> it falls over.
>>> 
>>> Given that, it seems that the answer to your original question is 
>>> that to
>> improve codegen for this case, the proper place to look is in 
>> instruction selection for loads and stores to the VFP/NEON registers. 
>> That code can be made smarter to better use the NEON instructions. I 
>> know Jakob has done some work related to better utilization of those for
> other constructs.
>>> 
>>> -Jim
>>> 
>>> On Sep 5, 2012, at 4:25 PM, Peter Couperus<peter.couperus at st.com>  wrote:
>>> 
>>>> Hello Jim,
>>>> 
>>>> Thank you for the response.  I may be confused about the alignment 
>>>> rules
>> here.
>>>> I had been looking at the ARM RVCT Assembler Guide, which seems to 
>>>> indicate vld1.16 operates on 16-bit aligned data, unless I am
>> misinterpreting their table (Table 5-11 in ARM DUI 0204H, pg 5-70,5-71).
>>>> Prior to the table, It does mention the accesses need to be "element"
>> aligned, where I took element in this case to mean i16.
>>>> 
>>>> Anyhow, to make this a little more concrete:
>>>> 
>>>> void extend(short* a, int* b) {
>>>> for(int i = 0; i<  8; i++)
>>>>   b[i] = (int)a[i];
>>>> }
>>>> 
>>>> When I compile this program with clang -O3 -ccc-host-triple 
>>>> armv7-none-linux-gnueabi -mfpu=neon -mllvm -vectorize, the 
>>>> intermediate
>> LLVM assembly looks OK (and it has an align 2 vector load), but the 
>> generated ARM assembly has the scalar loads.
>>>> When I compile with (4.6) gcc -std=c99 -ftree-vectorize -marm 
>>>> -mfpu=neon
>> -O3, it uses vld1.16 and vst1.32 regardless of the parameter alignment.
>> This is on armv7a.
>>>> 
>>>> The gcc version (and the clang version with our modified backend) 
>>>> runs
>> fine, even on 2-byte aligned data.  Is this not a guarantee across 
>> armv7/armv7a generally?
>>>> 
>>>> Pete
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On 09/05/2012 03:15 PM, Jim Grosbach wrote:
>>>>> VLD1 expects a 64-bit aligned address unless the target explicitly 
>>>>> days
>> that unaligned loads are OK.
>>>>> 
>>>>> For your situation, either the subtarget should set 
>>>>> AllowsUnalignedMem
>> to true (if that's accurate), or the load address should be made 
>> 64-bit aligned.
>>>>> 
>>>>> -Jim
>>>>> 
>>>>> On Sep 5, 2012, at 2:42 PM, Peter Couperus<peter.couperus at st.com>
>> wrote:
>>>>> 
>>>>>> Hello all,
>>>>>> 
>>>>>> I am a first time writer here, but am a happy LLVM tinkerer.  It 
>>>>>> is a
>> pleasure to use :).
>>>>>> We have come across some sub-optimal behavior when LLVM lowers 
>>>>>> loads for vectors with small integers, i.e. load<4 x i16>* %a, 
>>>>>> align 2,
>> using a sequence of scalar loads rather than a single vld1 on armv7 
>> linux with NEON.
>>>>>> Looking at the code in svn, it appears the ARM backend is capable 
>>>>>> of
>> lowering these loads as desired, and will if we use an appropriate 
>> darwin triple.
>>>>>> It appears this was actually enabled relatively recently.
>>>>>> Seemingly, the case where the Subtarget has NEON available should 
>>>>>> be
>> handled the same on Darwin and Linux.
>>>>>> Is this true, or am I missing something?
>>>>>> Do the regulars have an opinion on the best way to handle this?
>>>>>> Thanks!
>>>>>> 
>>>>>> Pete
>>>>>> 
>>>>>> _______________________________________________
>>>>>> LLVM Developers mailing list
>>>>>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>> <extend.c>
>> 
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>> 
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> 
>