[PATCH] D15741: [X86] Avoid folding scalar loads into unary sse intrinsics

Wed Dec 30 08:39:47 PST 2015

spatel accepted this revision.
spatel added a comment.
This revision is now accepted and ready to land.

In http://reviews.llvm.org/D15741#317959, @mkuper wrote:

>   def CVTSD2SSrm  : I<0x5A, MRMSrcMem, (outs FR32:$dst), (ins f64mem:$src),
>                       "cvtsd2ss\t{$src, $dst|$dst, $src}",
>                       [(set FR32:$dst, (fround (loadf64 addr:$src)))],
>                       IIC_SSE_CVT_Scalar_RM>,
>                       XD,
>                   Requires<[UseSSE2, OptForSize]>, Sched<[WriteCvtF2FLd]>;

Ah, I managed to miss that one.
How about adding some 'FIXME' notes and/or changing the other defs since we're currently inconsistent about this? LGTM otherwise.

  float f1(int *x) { return *x; } 
  double f2(int *x) { return *x; }
  float f3(long long *x) { return *x; }
  double f4(long long *x) { return *x; }
  float f5(double *x) { return *x; }
  double f6(float *x) { return *x; }

  $ ./clang -O1 ss2si.c -S -o - |grep cvt
  cvtsi2ssl	(%rdi), %xmm0
  cvtsi2sdl	(%rdi), %xmm0
  cvtsi2ssq	(%rdi), %xmm0
  cvtsi2sdq	(%rdi), %xmm0
  cvtsd2ss	%xmm0, %xmm0
  cvtss2sd	%xmm0, %xmm0

Regarding handling this via ExeDepsFix - it's not clear to me that its current solution:

  xorps %xmm0, %xmm0
  cvtsi2ssl (%rdi) %xmm0

would be better than unfolding the load. I think the xorps instruction saves a byte in all cases, but it may be micro-arch-dependent whether that's actually cheaper?

================
Comment at: lib/Target/X86/X86InstrSSE.td:3392
@@ +3391,3 @@
+  // We don't want to fold scalar loads into these instructions unless optimizing
+  // for size. This is because the folded instruction will have a partial register
+  // update, while the unfolded sequence will not, e.g.
----------------
mkuper wrote:
> spatel wrote:
> > 80-cols.
> The TDs don't enforce 80-cols consistently, and I never remember whether they should. Thanks. :-)
It seems like we mostly try to follow the law, but I would fully support a new rule for these files. 

80-cols causes a lot of extra suffering trying to make sense of this code that's already hard to read. :) 

http://reviews.llvm.org/D15741