[PATCH] D38196: [AArch64] Avoid interleaved SIMD store instructions for Exynos

Fri Nov 10 09:49:46 PST 2017

az marked an inline comment as done.
az added inline comments.

================
Comment at: llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp:71-73
+  // This is used to cache instruction replacement decisions within function
+  // units and across function units.
+  std::map<unsigned, bool> SIMDInstrTable;
----------------
kristof.beyls wrote:
> az wrote:
> > kristof.beyls wrote:
> > > The cache here will also need to take into account the TargetSubtargetInfo as the key into the cache.
> > > Otherwise, wouldn't you get unexpected results for e.g. the following .ll file with a mixture of targeting different subtargets (see "target-cpu" function attributes)?
> > > 
> > > ```
> > > declare <2 x float> @llvm.fma.v2f32(<2 x float>, <2 x float>, <2 x float>)
> > > 
> > > define <2 x float> @test_vfma_lane_f32_a57(<2 x float> %a, <2 x float> %b, <2 x float> %v)
> > > "target-cpu"="cortex-a57" {
> > > entry:
> > >   %lane = shufflevector <2 x float> %v, <2 x float> undef, <2 x i32> <i32 1, i32 1>
> > >   %0 = tail call <2 x float> @llvm.fma.v2f32(<2 x float> %lane, <2 x float> %b, <2 x float> %a)
> > >   ret <2 x float> %0
> > > }
> > > 
> > > define <2 x float> @test_vfma_lane_f32_exynos(<2 x float> %a, <2 x float> %b, <2 x float> %v)
> > > "target-cpu"="exynos-m1" {
> > > entry:
> > >   %lane = shufflevector <2 x float> %v, <2 x float> undef, <2 x i32> <i32 1, i32 1>
> > >   %0 = tail call <2 x float> @llvm.fma.v2f32(<2 x float> %lane, <2 x float> %b, <2 x float> %a)
> > >   ret <2 x float> %0
> > > }
> > > ```
> > > 
> > > I have no idea how to get a stable key from TargetSubtargetInfo to use in this cache.
> > > 
> > > At this point, I'd measure the overhead of this pass e.g. using CTMark to make sure that the caching mechanism is actually needed before investing more time in it.
> > > 
> > I am adding subtarget as part of key. 
> Thanks! Could you also add a regression test similar to the LLVM-IR I wrote in my original comment about this to make sure this functionality is tested?
Added your test as is (with a change to function names).

================
Comment at: llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp:267-278
+    for (unsigned i=0; i<MaxInstrKind; i++)
+    {
+      OriginalMCID = &TII->get(IRI[i].OrigOpc);
+      if (IRI[i].AccT == TwoVectors) {
+        ReplInstrMCID.push_back(&TII->get(IRI[i].ReplOpc[0]));
+        ReplInstrMCID.push_back(&TII->get(IRI[i].ReplOpc[1]));
+        ReplInstrMCID.push_back(&TII->get(IRI[i].ReplOpc[2]));
----------------
kristof.beyls wrote:
> I like the table-driven approach a lot better - the code is a lot cleaner, easier to read and easier to maintain.
> However, wouldn't it be even cleaner to not let the code behave special based on TwoVectors/FourVectors?
> I guess this would mean spelling out the whole sequence of replacement instructions in the table, instead of using the 'for (unsigned j=0; j<4, j++)' loop here, and it would make the table larger?
> I still guess that may be the right tradeoff to make, as it will make it easier to understand what the patch does from just reading the table.
> And I guess it will also make it easier to later add support for other replacement patterns?
Removed the TwoVectors/FourVectors fields and enum.

https://reviews.llvm.org/D38196