[PATCH] D41096: [X86] Initial support for prefer-vector-width function attribute

Wed Dec 13 11:42:22 PST 2017

hfinkel added a comment.

In https://reviews.llvm.org/D41096#954199, @craig.topper wrote:

> In https://reviews.llvm.org/D41096#954134, @hfinkel wrote:
>
> > In https://reviews.llvm.org/D41096#953284, @craig.topper wrote:
> >
> > > The dependency isn't really an llvm dependency. You're absolutely right, that if it was just that it would be fixable by changing our instruction patterns. The problem here is about how the features and dependencies are defined in our instructions manuals. avx512f means "avx512 foundation", but only includes 512 bit instructions. The latter features imply the foundation is present.
> > >
> > > A user may expect that if they use "-march=skylake-avx512" that they can use 512-bit intrinsics in x86intrin.h. But at the same time we should generate good performing code for other users who specify -march=skylake-avx512 and didn't write any vector code in their source. Based on the data we've seen so far, the best way to do that is to disable 512-bit instructions.
> >
> >
> > Understood. However, we can separate this within the implementation. Specifically, Clang does not need to tag all generated functions with the same target attribute. On the LLVM side, we can separate these, and then use that finer-grained control.
> >
> > > So what I'm looking for a solution that optimizes for the likely case that the user code doesn't contain 512-bit vector intrinsics and tells the legalizer to use 256-bit registers only. But if the user code does contains explicit 512-bit intrinsics, we still allow that code to compile and ideally generate what the user expected. I don't want existing 512-bit intrinsic code to suddenly stop compiling with -march=skylake-avx512.
> >
> > I think that we're on the same page. In addition to intrinsics, I'm also worried about OpenMP SIMD loops/functions (or other explicit vectorization). We may want those to also give 512-bit vectors by default (presumably, such specifically-tagged loops are likely to run long enough to amortize the clock-rate effects).
> >
> > In https://reviews.llvm.org/D41096#954037, @craig.topper wrote:
> >
> > > getHostCPUName/getHostCPUFeatures is called by the clang driver long before any code is parsed. How would it know if the code contained any 512-bit instructions?
> >
> >
> > I recommend that we do this in Clang during CodeGen. We should add a special callback that will allow TargetInfo to examine the AST and adjust the target (or target features) on a per-function basis. Any function using AXV-512 intrinsics on 512-bit vectors, explicit vector types, or OpenMP SIMD (unless a small simdlen clause is present) will stay as-is, and everything else will be modified to turn off 512-bit vectors.
>
>
> I definitely don't want to change the target-cpu string added by clang. Maybe a target feature. This problem may continue in follow on CPUs after skx and I don't want to start coding a specific list of CPUs into clang. We already have more information about CPU feature mapping in clang than I think we would really like.
>
> > The real question is whether to block inlining on mismatch here. I don't think that we should (and we'll need to enable 512-bit vectors in the caller). The problem is that people write C++ wrappers around vector intrinsics, and we need the compiler to remove the abstraction layer. Generating that code poorly will be a significant problem. This will have the unfortunate "action at a distance" effects we discussed earlier (because having some 512-bit vectors in some function, even after inlining, will suddenly enable it elsewhere in the function), but I don't see any good way to prevent that in undesirable cases without causing significant problems elsewhere.
>
> I agree we don't want to block inlining on a mismatch. Do we have a way to allow targets to control the merging behavior? If we do this as part of the "target-feature" or "target-cpu" attribute we would need custom merging.

We don't currently, we have only areInlineCompatible in TTI. This is called like this:

  return TTI.areInlineCompatible(Caller, Callee) &&
         AttributeFuncs::areInlineCompatible(*Caller, *Callee);

we also have a:

  AttributeFuncs::mergeAttributesForInlining(*Caller, *Callee);

adding a corresponding TTI function and calling it in the two places where AttributeFuncs::mergeAttributesForInlining is called would be straightforward.

> Alternatively, I was thinking about a separate "required-vector-width" attribute. Either clang codegen, or an early IR pass

I prefer that we do this in Clang's CodeGen. We just don't have enough information at the IR level to differentiate between things the user explicitly requested and things that have been added by some earlier stage automatically (plus, the pass method would need to rely on pass injection, or similar, and that won't work with frontends with custom pipelines anyway).

> could analyze the function and set the attribute based on the largest vector width, or OpenMP SIMD, ABI requirement, etc. The inliner could merge this my maxing the caller and callee value. This could be generated independent of the target being X86. This attribute could be consumed by the X86 backend to limit the legalizer if its present and the value is 256 or less and the CPU is skylake-avx512 or any CPU with the frequency.

That makes sense to me.

In the backend, I imagine you'll still essentially end up splitting the features and then setting things similar in this patch in getSubtargetImpl. Is that the idea?

https://reviews.llvm.org/D41096