[PATCH] D49720: [ARM] Fix over-alignment in arguments that are HA of 128-bit vectors

Tue Jul 24 01:39:12 PDT 2018

petpav01 created this revision.
petpav01 added reviewers: t.p.northover, olista01, eli.friedman.
Herald added a reviewer: javed.absar.
Herald added subscribers: llvm-commits, chrib, kristof.beyls.

Code in `CC_ARM_AAPCS_Custom_Aggregate()` is responsible for handling homogeneous aggregates for `CC_ARM_AAPCS_VFP`. When an aggregate ends up fully on stack, the function tries to pack all resulting items of the aggregate as tightly as possible. Once the first item is laid out, the alignment used for consecutive items is the size of one item.

This logic goes wrong for 128-bit vectors because their alignment is normally only 64 bits, and so can result in inserting unexpected padding between the first and second element.

Example:

  $ cat test.c
  #include <arm_neon.h>
  
  typedef struct {
    double A[4];
  } S_d64_4;
  
  typedef struct {
    uint32x4_t A[2];
  } S_v128_2;
  
  int foo(S_d64_4 P0, S_d64_4 P1, float P2, S_v128_2 P3) {
    // * P0 is passed in D0-D3.
    // * P1 is passed in D4-D7.
    // * P2 is passed in [SP, SP+4).
    // * P3.A[0] is passed in [SP+8, SP+24).
    // * P3.A[1] should be passed according to AAPCS in [SP+24, SP+40) but the
    //   code produced by Clang/LLVM expects it in [SP+32, SP+48).
    return vgetq_lane_u32(P3.A[0], 0) + vgetq_lane_u32(P3.A[1], 0);
  }
  
  $ clang -target arm-none-eabi -mcpu=cortex-a53 -S test.c -o -
  [...]
  foo:
          push    {r11, lr}
          mov     r11, sp
          sub     sp, sp, #8
          bfc     sp, #0, #4
          ldr     r0, [r11, #40]   /* load from entry-SP + #32 */
          ldr     r1, [r11, #16]   /* load from entry-SP + #8 */
          add     r0, r1, r0
          mov     sp, r11
          pop     {r11, pc}

The proposed patch fixes the problem by updating the alignment with the item size only if this results in reducing it.


Repository:
  rL LLVM

https://reviews.llvm.org/D49720

Files:
  lib/Target/ARM/ARMCallingConv.h
  test/CodeGen/ARM/aggregate-padding.ll


Index: test/CodeGen/ARM/aggregate-padding.ll
===================================================================

--- test/CodeGen/ARM/aggregate-padding.ll
+++ test/CodeGen/ARM/aggregate-padding.ll
@@ -99,3 +99,19 @@
   %sum = add i16 %val0, %val2
   ret i16 %sum
 }
+
+; [2 x <4 x i32>] should be aligned only on a 64-bit boundary and contiguous.
+; None of the two <4 x i32> elements should introduce any padding to 128 bits.
+define i32 @test_4xi32_64bit_aligned_and_contiguous([8 x double], float, [2 x <4 x i32>] %arg) nounwind {
+; CHECK-LABEL: test_4xi32_64bit_aligned_and_contiguous:
+; CHECK-DAG: ldr [[VAL0_0:r[0-9]+]], [sp, #8]
+; CHECK-DAG: ldr [[VAL1_0:r[0-9]+]], [sp, #24]
+; CHECK: add r0, [[VAL0_0]], [[VAL1_0]]
+
+  %val0 = extractvalue [2 x <4 x i32>] %arg, 0
+  %val0_0 = extractelement <4 x i32> %val0, i32 0
+  %val1 = extractvalue [2 x <4 x i32>] %arg, 1
+  %val1_0 = extractelement <4 x i32> %val1, i32 0
+  %sum = add i32 %val0_0, %val1_0
+  ret i32 %sum
+}
Index: lib/Target/ARM/ARMCallingConv.h
===================================================================
--- lib/Target/ARM/ARMCallingConv.h
+++ lib/Target/ARM/ARMCallingConv.h
@@ -276,7 +276,7 @@
     // After the first item has been allocated, the rest are packed as tightly
     // as possible. (E.g. an incoming i64 would have starting Align of 8, but
     // we'll be allocating a bunch of i32 slots).
-    Align = Size;
+    Align = std::min(Align, Size);
   }
 
   // All pending members have now been allocated


-------------- next part --------------
A non-text attachment was scrubbed...
Name: D49720.156977.patch
Type: text/x-patch
Size: 1502 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20180724/1f29cfe0/attachment.bin>