[LLVMdev] Unaligned vector memory access for ARM/NEON.

Thu Sep 6 15:38:38 PDT 2012

On Sep 6, 2012, at 2:48 PM, David Peixotto <dpeixott at codeaurora.org> wrote:

> Hi Pete,
> 
> We ran into the same issue with generating vector loads/stores for vectors
> with less than word alignment. It seems we took a similar approach to
> solving the problem by modifying the logic in allowsUnalignedMemoryAccesses.
> 
> As you and Jim mentioned, it looks like the vld1/vst1 instructions should
> support element aligned access for any armv7 implementation (I'm looking at
> Table A3-1 ARM Architecture Reference Manual - ARM DDI 0406C).
> 
> Right now I do not think we have the correct code setup in ARMSubtarget to
> accurately represent this table. I would propose that we keep the existing
> field for unaligned access and add a new field for element-aligned access.
> 
> The AllowsUnAlignedMem field remains as is and it could be used to represent
> the SCTLR.A column in Table A3-1. The AllowsElementAlignedNEON field would
> be used allow targets to generate vld1/vst1 instructions for element-aligned
> accesses. By default it would be set to true for armv7 targets with NEON.

That doesn't make sense to me.  Element-aligned Neon load/stores are always valid.  The AllowsUnalignedMem setting is supposed to model the SCTLR.A setting, which also applies to Neon load/stores with less-than-element alignment.  Why would we need a new field?

> 
> The -arm-strict-align would set both of the fields to false. This would
> retain the behavior that seems to be desired from the
> test/CodeGen/ARM/unaligned_load_store.ll test case.
> 
> A bit of a grey area is if we have an unaligned f64 store and
> AllowsElementAlignedNEON is true. We can actually generate a vstr1.8 to
> support this store directly instead of using the target-independent method
> and I think it would be good to do so.

Don't we already do this as of svn r161962?

> 
> I have some code to do this that I will likely be able to upstream.
> 
> -Dave
> 
> -- Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted
> by The Linux Foundation
> 
> 
> -----Original Message-----
> From: llvmdev-bounces at cs.uiuc.edu [mailto:llvmdev-bounces at cs.uiuc.edu] On
> Behalf Of Peter Couperus
> Sent: Thursday, September 06, 2012 8:14 AM
> To: Jim Grosbach
> Cc: Jakob Olesen; llvmdev at cs.uiuc.edu (LLVMdev at cs.uiuc.edu)
> Subject: Re: [LLVMdev] Unaligned vector memory access for ARM/NEON.
> 
> Hello,
> 
> Thanks again.  We did try overestimating the alignment, and saw the vldr you
> reference here.
> It looks like a recent change (r161962?) did enable vld1 generation for this
> case (great!) on darwin, but not linux.
> I'm not sure if the effect of lowering load <4 x i16>* align 2 to
> vld1.16 this was intentional in this change or not.
> If so, my question is what is the preferable way to inform the Subtarget
> that it is allowed to use unaligned vector loads/stores when NEON is
> available, but can't use unaligned accesses generally speaking?
> A new field in ARMSubtarget?
> Should the -arm-strict-align flag force expansion even on unaligned vector
> loads/stores?
> We got this working by adding a field to ARMSubtarget and changing logic in
> ARMTargetLowering::allowsUnalignedMemoryAccesses, but I am admittedly not
> entirely sure of the downstream consequences of this, as we don't allow
> unaligned access generally.
> 
> Pete
> 
> 
> On 09/05/2012 04:58 PM, Jim Grosbach wrote:
>> Hmmm. Well, it's entirely possible that it's LLVM that's confused 
>> about the alignment requirements here. :)
>> 
>> I think I see, in general, where. I twiddled the IR to give it higher
> alignment (16 bytes) and get:
>> extend:                                 @ @extend
>> @ BB#0:
>> 	vldr	d16, [r0]
>> 	vmovl.s16	q8, d16
>> 	vstmia	r1, {d16, d17}
>> 	vldr	d16, [r0, #8]
>> 	add	r0, r1, #16
>> 	vmovl.s16	q8, d16
>> 	vstmia	r0, {d16, d17}
>> 	bx	lr
>> 
>> Note that we're using a plain vldr instruction here to load the d
> register, not a vld1 instruction. Similarly for the stores. According to the
> ARM ARM (DDI 0406C),  you're correct about the element size alignment
> requirement for VLD1, but our isel isn't attempting to use that instruction,
> but rather VLDR, which has word alignment required, so it falls over.
>> 
>> Given that, it seems that the answer to your original question is that to
> improve codegen for this case, the proper place to look is in instruction
> selection for loads and stores to the VFP/NEON registers. That code can be
> made smarter to better use the NEON instructions. I know Jakob has done some
> work related to better utilization of those for other constructs.
>> 
>> -Jim
>> 
>> On Sep 5, 2012, at 4:25 PM, Peter Couperus<peter.couperus at st.com>  wrote:
>> 
>>> Hello Jim,
>>> 
>>> Thank you for the response.  I may be confused about the alignment rules
> here.
>>> I had been looking at the ARM RVCT Assembler Guide, which seems to 
>>> indicate vld1.16 operates on 16-bit aligned data, unless I am
> misinterpreting their table (Table 5-11 in ARM DUI 0204H, pg 5-70,5-71).
>>> Prior to the table, It does mention the accesses need to be "element"
> aligned, where I took element in this case to mean i16.
>>> 
>>> Anyhow, to make this a little more concrete:
>>> 
>>> void extend(short* a, int* b) {
>>>  for(int i = 0; i<  8; i++)
>>>    b[i] = (int)a[i];
>>> }
>>> 
>>> When I compile this program with clang -O3 -ccc-host-triple 
>>> armv7-none-linux-gnueabi -mfpu=neon -mllvm -vectorize, the intermediate
> LLVM assembly looks OK (and it has an align 2 vector load), but the
> generated ARM assembly has the scalar loads.
>>> When I compile with (4.6) gcc -std=c99 -ftree-vectorize -marm -mfpu=neon
> -O3, it uses vld1.16 and vst1.32 regardless of the parameter alignment.
> This is on armv7a.
>>> 
>>> The gcc version (and the clang version with our modified backend) runs
> fine, even on 2-byte aligned data.  Is this not a guarantee across
> armv7/armv7a generally?
>>> 
>>> Pete
>>> 
>>> 
>>> 
>>> 
>>> On 09/05/2012 03:15 PM, Jim Grosbach wrote:
>>>> VLD1 expects a 64-bit aligned address unless the target explicitly days
> that unaligned loads are OK.
>>>> 
>>>> For your situation, either the subtarget should set AllowsUnalignedMem
> to true (if that's accurate), or the load address should be made 64-bit
> aligned.
>>>> 
>>>> -Jim
>>>> 
>>>> On Sep 5, 2012, at 2:42 PM, Peter Couperus<peter.couperus at st.com>
> wrote:
>>>> 
>>>>> Hello all,
>>>>> 
>>>>> I am a first time writer here, but am a happy LLVM tinkerer.  It is a
> pleasure to use :).
>>>>> We have come across some sub-optimal behavior when LLVM lowers 
>>>>> loads for vectors with small integers, i.e. load<4 x i16>* %a, align 2,
> using a sequence of scalar loads rather than a single vld1 on armv7 linux
> with NEON.
>>>>> Looking at the code in svn, it appears the ARM backend is capable of
> lowering these loads as desired, and will if we use an appropriate darwin
> triple.
>>>>> It appears this was actually enabled relatively recently.
>>>>> Seemingly, the case where the Subtarget has NEON available should be
> handled the same on Darwin and Linux.
>>>>> Is this true, or am I missing something?
>>>>> Do the regulars have an opinion on the best way to handle this?
>>>>> Thanks!
>>>>> 
>>>>> Pete
>>>>> 
>>>>> _______________________________________________
>>>>> LLVM Developers mailing list
>>>>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>> <extend.c>
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> 
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev