[PATCH] D26103: Add tips for generic IR vs architecture specific code.

Fri Oct 28 15:54:12 PDT 2016

asbirlea created this revision.
asbirlea added a reviewer: reames.
asbirlea added subscribers: llvm-commits, mkuper.
Herald added a subscriber: aemerson.

The patch is a start for encouraging frontends to codegen generic IR, while tuning for specific architectures.
Starting off with two examples: strided loads and stores for ARM/AArch64.
I expect the doc to be expand to more patterns, some still under discussion.

The doc could also refer to something more specific, such as "codegen of vector code".
If the content gets too large, it could be moved as a subpage.

Please suggest who is best to review this.
Adding Philip as the doc owner, and Michael as fyi for future AVX512 doc.


https://reviews.llvm.org/D26103

Files:
  docs/Frontend/PerformanceTips.rst


Index: docs/Frontend/PerformanceTips.rst
===================================================================

--- docs/Frontend/PerformanceTips.rst
+++ docs/Frontend/PerformanceTips.rst
@@ -120,6 +120,93 @@
 lower an under aligned access into a sequence of natively aligned accesses.  
 As a result, alignment is mandatory for atomic loads and stores.
 
+Architecture-specific code
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+Whenever possible, the IR generated should be generic IR, instead of architecture
+specific IR (i.e. intrinsics).
+If LLVM cannot lower the generic code to the desired intrinsic, start a discussion
+on `llvm-dev <http://lists.llvm.org/mailman/listinfo/llvm-dev>`_ 
+for the missing lowering opportunity.
+A few known patterns that lead to lowering to intrinsics are listed below.
+
+The *interleaved access pass* performs the following lowerings (tests can be found in CodeGen/ARM/arm-interleaved-accesses.ll and CodeGen/AArch64/aarch64-interleaved-accesses.ll):
+
+#. ARM/AArch64: lower an interleaved/strided load into a vldN/ldN intrinsic. 
+        * General rule: Factor = F, Lane Length = L: 
+                ::
+
+                        %wide.vec = load %ptr 
+                        %v1 = shufflevector %wide.vec, undef, <m1,     m1+F,     ..., m1+(L-1)*F> 
+                        [...] 
+                        %vF = shufflevector %wide.vec, undef, <m1+F-1, m1+2*F-1, ..., m1+L*F-1> 
+
+         Is lowered to: 
+                ::
+
+                        %ldF = call @llvm.arm.neon.vldF(%ptr, L)
+                                  ; @llvm.aarch64.neon.ldF(%ptr) 
+                        %vec1 = extractvalue %ldF, 0 
+                        [...]
+                        %vecF = extractvalue %ldF, F-1 
+
+        * E.g. Factor = 2, Lane Length = 4:
+                .. code-block:: llvm
+
+                        %wide.vec = load <8 x i32>, <8 x i32>* %ptr
+                        %v0 = shufflevector <8 x i32> %wide.vec, <8 x i32> undef, <4 x i32> <i32 0, i32 2, i32 4, i32 6>  ; Extract even elements
+                        %v1 = shufflevector <8 x i32> %wide.vec, <8 x i32> undef, <4 x i32> <i32 1, i32 3, i32 5, i32 7>  ; Extract odd elements
+
+         Is lowered to:
+                ::
+
+                        %ld2 = call { <4 x i32>, <4 x i32> } @llvm.arm.neon.vld2(<8 x i32>* %ptr, i32 4)
+                                                           ; @llvm.aarch64.neon.ld2(<8 x i32>* %ptr)
+                        %vec0 = extractvalue { <4 x i32>, <4 x i32> } %ld2, 0
+                        %vec1 = extractvalue { <4 x i32>, <4 x i32> } %ld2, 1
+
+#. ARM/AArch64: lower an interleaved/strided store into a vstN/stN intrinsic.
+        * General rule: Factor = F, Lane Length = L:
+                ::
+
+                        %i.vec = shufflevector %v0,  %v1, 
+                                        <m1,     m2,     ..., mF,
+                                         m1+1,   m2+1,   ..., mF+1,
+                                         ...,
+                                         m1+L-1, m2+L-1, ..., mF+L-1>
+                        store  %i.vec, %ptr
+          Is lowered to:
+                ::
+
+                        %sub.v1 = shufflevector %v0, %v1, <m1, ..., m1+L-1>
+                        [...]
+                        %sub.vF = shufflevector %v0, %v1, <mF, ..., mF+L-1>
+                        call void @llvm.arm.neon.vstF(%ptr, %sub.v1, ..., %sub.vF, L)
+                                ; @llvm.aarch64.neon.stF(%sub.v1, ..., %sub.vF, %ptr)
+
+        * E.g. Factor = 3, Lane Length = 4:
+                .. code-block:: llvm
+
+                        %i.vec = shufflevector <8 x i32> %v0, <8 x i32> %v1,
+                                        <i32 0, i32 4, i32 8, 
+                                         i32 1, i32 5, i32 9, 
+                                         i32 2, i32 6, i32 10, 
+                                         i32 3, i32 7, i32 11>
+                        store <12 x i32> %i.vec, <12 x i32>* %ptr
+
+          Is lowered to:
+                .. code-block:: llvm
+
+                        %sub.v0 = shufflevector <8 x i32> %v0, <8 x i32> %v1, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
+                        %sub.v1 = shufflevector <8 x i32> %v0, <8 x i32> %v1, <4 x i32> <i32 4, i32 5, i32 6, i32 7>
+                        %sub.v2 = shufflevector <8 x i32> %v0, <8 x i32> %v1, <4 x i32> <i32 8, i32 9, i32 10, i32 11>
+                        call void @llvm.arm.neon.vst3(<8 x i32> %ptr, <4 x i32> %sub.v0, <4 x i32> %sub.v1, <4 x i32> %sub.v2, i32 4)
+                                ; @llvm.aarch64.neon.st3(<4 x i32> %sub.v0, <4 x i32> %sub.v1, <4 x i32> %sub.v2, <8 x i32> %ptr)
+
+LLVM does not promise to be performance aware, so the above patterns, while generic IR,  are still recommended for the particular platforms.
+For more suggestions of architecture specific patterns, please send
+a patch to `llvm-commits
+<http://lists.llvm.org/mailman/listinfo/llvm-commits>`_ for review.
+
 Other Things to Consider
 ^^^^^^^^^^^^^^^^^^^^^^^^
 


-------------- next part --------------
A non-text attachment was scrubbed...
Name: D26103.76264.patch
Type: text/x-patch
Size: 5065 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20161028/605f0e56/attachment.bin>