[PATCH] Optimization for certain shufflevector by using insertps.

Thu Apr 24 23:52:08 PDT 2014

Hi Elena and Andrea,

1 - I've changed the check + change to NormalizeVectorShuffe(), I will update the patch soon;

2 - This won't be as easy as it looks (see below);

3 - I don't understand what the problem would be in making the load smaller. Especially since we do these kinds of load reductions in other places (including target independent opts like in DAGCombiner::ReduceLoadOpStoreWidth, and others). The alignment requirements don't get stricter, so we shouldn't output code that might break when the original wouldn't. And if I'm not mistaken, the alignment exceptions would fall under undefined behaviour, which wouldn't make this optimization invalid;
  By the way, with the current transform, we would never generate an add instruction for the insertps. We just generate the insertps with the address calculation in it (just to be sure we're on the same page).

4 - I will change the test. After this fix I'll also change other tests in the same file that similarly check for the same text on X32 and X64.

About supporting i32 besides f32:
  I can simply do the transform for MVT::v4i32 too, but it will not optimize as much as the current one does for f32.
  It will optimize some cases but, for example, we might still get movsd+insertps when insertps should suffice.

To fix this problem, we add a bunch of new ones:
  I don't know that much about the backend, but there doesn't seem to be a way to do it without creating a new SDNode type for INSERTPS (I don't know what it would be called, since INSERTPS is already used). (X86InstrFragmentsSIMD.td:84)
  I can't find a way to have, on the type profile, an OR condition, which would allow us to define X86insertps as getting a f32 or i32 from memory.
  Without this, the pattern we generate in X86InstrSSE.td:6522 won't match due to its types (even if we define a whole new multiclass for the i32 case).

I think the best course of action would be to:
  - Add some more tests, for the i32 cases (and collapse the tests to a common check-prefix);
  - Do the transform on both i32 and f32, for now, even if the generated instructions aren't perfect;
  - Later figure out how to better express the insertps for i32 in the backed (either with a new instruction, or a simpler one-time hack for this one).

What do you think?

  Filipe

http://reviews.llvm.org/D3475