[llvm-commits] [llvm] r142152 - in /llvm/trunk: lib/CodeGen/SelectionDAG/ test/CodeGen/ARM/ test/CodeGen/CellSPU/ test/CodeGen/X86/

Tue Oct 18 00:37:05 PDT 2011

Owen, 

I discussed the legalization of <2 x i16> stores on ARM with Anton. As you mentioned, i16 is illegal on ARM and it is not possible to scalarize the store in the Legalizer. 
This was the main reason for moving the legalization of vector memory ops into LegalizeVectorOps. 

I agree that in some cases promoting the elements in the vector is less efficient than widening the number of elements.  However, generally ‘promotion’ is a better strategy.  I am mostly interested in code-generation of auto-vectorized IR.  What workloads are you mostly interested in ? Maybe we can discuss the needed optimizations for these workloads.  I benchmarked x86 programs (w/ SSE and AVX), but ARM and other SIMD processors should be similar.  

Thanks,
Nadav

From: Owen Anderson [mailto:resistor at mac.com] 
Sent: Monday, October 17, 2011 23:53
To: Rotem, Nadav
Cc: llvm-commits at cs.uiuc.edu
Subject: Re: [llvm-commits] [llvm] r142152 - in /llvm/trunk: lib/CodeGen/SelectionDAG/ test/CodeGen/ARM/ test/CodeGen/CellSPU/ test/CodeGen/X86/

Nadav,

On Oct 17, 2011, at 1:37 PM, Rotem, Nadav wrote: 
The new type-legalization generates new code sequences. For example: trunc store and anyext loads.  I implemented fast load/store sequences  for x86.  Other targets also need to optimize the new sequences.  I opened a bug report on this PR11158.

Unfortunately, that PR does not cover the issue exposed in the testcase I pointed out, which is a real, significant performance issue with this approach.  Take a look at this snippet:

define void @test_vrev64(<4 x i16>* nocapture %source, <2 x i16>* nocapture %dst) nounwind ssp {
entry:
  %0 = bitcast <4 x i16>* %source to <8 x i16>*
  %tmp2 = load <8 x i16>* %0, align 4
  %tmp3 = extractelement <8 x i16> %tmp2, i32 6
  %tmp5 = insertelement <2 x i16> undef, i16 %tmp3, i32 0
  %tmp9 = extractelement <8 x i16> %tmp2, i32 5
  %tmp11 = insertelement <2 x i16> %tmp5, i16 %tmp9, i32 1
  store <2 x i16> %tmp11, <2 x i16>* %dst, align 4
  ret void
}

In NEON, vectors of i16 are legal, but i16 is not.  The correct code generation sequence was to collapse all of the insert/extract vectors into a shuffle, at which point there are longer illegal types present.  With your change, we promote all the vectors to vector of i32, at which point we can no longer match the desired shuffle instruction, in addition to having to emit a (possibly inefficient) vector trunc_store.  Even if we do add an efficient trunc_store lowering to ARM backend, it will still be unable to match the efficient shuffle because we have obfuscated the code by promoting rather than collapsing it to a shuffle.

That collapse to a shuffle is what the test that removed was checking for.

--Owen
---------------------------------------------------------------------
Intel Israel (74) Limited

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.