[PATCH] D38196: [AArch64] Avoid interleaved SIMD store instructions for Exynos

Tue Nov 7 06:57:25 PST 2017

kristof.beyls added inline comments.

================
Comment at: llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp:71-73
+  // This is used to cache instruction replacement decisions within function
+  // units and across function units.
+  std::map<unsigned, bool> SIMDInstrTable;
----------------
az wrote:
> kristof.beyls wrote:
> > The cache here will also need to take into account the TargetSubtargetInfo as the key into the cache.
> > Otherwise, wouldn't you get unexpected results for e.g. the following .ll file with a mixture of targeting different subtargets (see "target-cpu" function attributes)?
> > 
> > ```
> > declare <2 x float> @llvm.fma.v2f32(<2 x float>, <2 x float>, <2 x float>)
> > 
> > define <2 x float> @test_vfma_lane_f32_a57(<2 x float> %a, <2 x float> %b, <2 x float> %v)
> > "target-cpu"="cortex-a57" {
> > entry:
> >   %lane = shufflevector <2 x float> %v, <2 x float> undef, <2 x i32> <i32 1, i32 1>
> >   %0 = tail call <2 x float> @llvm.fma.v2f32(<2 x float> %lane, <2 x float> %b, <2 x float> %a)
> >   ret <2 x float> %0
> > }
> > 
> > define <2 x float> @test_vfma_lane_f32_exynos(<2 x float> %a, <2 x float> %b, <2 x float> %v)
> > "target-cpu"="exynos-m1" {
> > entry:
> >   %lane = shufflevector <2 x float> %v, <2 x float> undef, <2 x i32> <i32 1, i32 1>
> >   %0 = tail call <2 x float> @llvm.fma.v2f32(<2 x float> %lane, <2 x float> %b, <2 x float> %a)
> >   ret <2 x float> %0
> > }
> > ```
> > 
> > I have no idea how to get a stable key from TargetSubtargetInfo to use in this cache.
> > 
> > At this point, I'd measure the overhead of this pass e.g. using CTMark to make sure that the caching mechanism is actually needed before investing more time in it.
> > 
> I am adding subtarget as part of key. 
Thanks! Could you also add a regression test similar to the LLVM-IR I wrote in my original comment about this to make sure this functionality is tested?

================
Comment at: llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp:163-170
+bool AArch64SIMDInstrOpt::
+shouldReplaceInst(MachineFunction *MF, const MCInstrDesc *InstDesc,
+                  SmallVectorImpl<const MCInstrDesc*> &InstDescRepl) {
+  // Check if replacement decision is already available in the cached table.
   // if so, return it.
-  if (!VecInstElemTable.empty() &&
-      VecInstElemTable.find(InstDesc->getOpcode()) != VecInstElemTable.end())
-    return VecInstElemTable[InstDesc->getOpcode()];
+  if (!SIMDInstrTable.empty() &&
+      SIMDInstrTable.find(InstDesc->getOpcode()) != SIMDInstrTable.end())
----------------
az wrote:
> kristof.beyls wrote:
> > Is the only reason you need to drop const from shouldReplaceInst and other methods that you're using operator[] here to get to an element that's guaranteed to be in the map?
> > If so, wouldn't it be better to use syntax "*SIMDInstrTable.find(InstDesc->getOpcode())", so that the const can remain on all those methods?
> Yes, I was having trouble setting SIMDInstrTable[...] to true or false (few lines below  the line of code you mention above). It thinks that the map member variable I added to the const class has the constant modifier too.
Thanks, that makes sense to me now.

================
Comment at: llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp:333-353
+      return false;
+    ReplInstrMCID.clear();
+    OriginalMCID = &TII->get(AArch64::ST4Fourv16b);
+    for (unsigned i=0; i<4; i++) {
+      ReplInstrMCID.push_back(&TII->get(AArch64::ZIP1v16i8));
+      ReplInstrMCID.push_back(&TII->get(AArch64::ZIP2v16i8));
+    }
----------------
az wrote:
> kristof.beyls wrote:
> > From a logical point of view, this is a lot of code that repeats information encoded in a different way in the big switch statements in optimizeLdStInterleave.
> > It'll be easy to update one location in the future and forget the other location.
> > Couldn't the logic in the switch statements in optimizeLdStInterleave be reused here somehow, so that the information is only encoded once?
> Good suggestion to simplify that code. There are several ways to improve it and here is one way: store the re-writing rules in a table member variable of the class and use that table in shouldExitEarly and in optimizeLdStInterleave. Note that I have not changed optimizeLdStInterleave yet (only used new table in shouldExitEarly). I would like to make sure that what I did is the kind of changes you had originally in mind and if so, then I will use it in optimizeLdStInterleave.
> 
Yes, this is the kind of change I had in mind.
I didn't try to figure out if everything could be driven from such a table, but what you wrote in shouldExitEarly is roughly what I had in mind (with a follow-on nit-pick about making the code even more data-driven in my other comment on the new code).

================
Comment at: llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp:267-278
+    for (unsigned i=0; i<MaxInstrKind; i++)
+    {
+      OriginalMCID = &TII->get(IRI[i].OrigOpc);
+      if (IRI[i].AccT == TwoVectors) {
+        ReplInstrMCID.push_back(&TII->get(IRI[i].ReplOpc[0]));
+        ReplInstrMCID.push_back(&TII->get(IRI[i].ReplOpc[1]));
+        ReplInstrMCID.push_back(&TII->get(IRI[i].ReplOpc[2]));
----------------
I like the table-driven approach a lot better - the code is a lot cleaner, easier to read and easier to maintain.
However, wouldn't it be even cleaner to not let the code behave special based on TwoVectors/FourVectors?
I guess this would mean spelling out the whole sequence of replacement instructions in the table, instead of using the 'for (unsigned j=0; j<4, j++)' loop here, and it would make the table larger?
I still guess that may be the right tradeoff to make, as it will make it easier to understand what the patch does from just reading the table.
And I guess it will also make it easier to later add support for other replacement patterns?

https://reviews.llvm.org/D38196