[PATCH] D43306: [X86] Add pass to infer required-vector-width attribute based on size of function arguments and use of intrinsics

Wed Mar 28 11:27:33 PDT 2018

chandlerc added a comment.

In https://reviews.llvm.org/D43306#1049842, @echristo wrote:

> > So for both modes, if there are any larger scalar types in the loop, those types will be multiplied by the VF factor and cause the vectorizer to create types that are larger than the maximum hardware register size. These types will exist all the way to codegen, and it is up to the SelectionDAG type legalization process or target specific DAG combines to split these larger operations. In some cases the larger types are used to represent idioms that contain extends and truncates that we were able to combine to more complex instructions without doing a split. X86 probably has more pre type legalization target specific DAG combines here than most other targets. And we are doing our own splitting for type legalization in some of these combines. This prevents the type legalizer from needing to deal with target specific nodes. Though the type legalizer is capable of calling ReplaceNodeResults and LowerOperation for target specific nodes.
>
> This is interesting, and perhaps not what we should do.

FWIW, I really like the ability of passes to create larger-than-legal types and rely on common type legalization to split and fit them into useful vector registers. This proves very useful. That's only one part of what this paragraph talks about though, ond the rest I don't have a strong opinion about.

> 
> 
>> In addition to the vectorizer, I've believe I've seen evidence of InstCombine creating larger types when it canonicalizes getelementptrs to have pointer width types by inserting sign extends. This shows up in vector GEPs used by gather and scatter intrinsics. So if the GEP used a v16i32 index, InstCombine will turn it into v16i64. X86 tries to remove some of these extends with a DAG combine, but we're not great at finding a uniform base address for gathers which causes the extend to be hidden.  So we end up spltting gathers when their result type is the maximum hardware register width, but their index is twice as large.
> 
> This is perhaps also not what we should do with vector accessed operations.

This does seem much more questionable. IMO, not because of the illegal vector types but because extending the scalar component seems much more troubling in a vector-gep context than a scalar-gep context, regardless of the arity of the vector.

> 
> 
>> Based on this current behavior of IR passes with respect to hardware vector width and the fact that codegen already has to handle illegal types. So the first part of my changes(prefer-vector-width), make the vectorizer behavior similar between prefer-vector-width=256 on an avx512f target and an avx2 target by making getRegisterBitWidth return 256 in both cases. So under prefer-vector-width=256, the vectorizer will calculate the same VF factor it would for an AVX2
> 
> 
> 
>> The patch presented here overcomes this limitation by providing a required-vector-width attribute that can be used to tell codegen that there is nothing in the IR that strictly requires wider vectors to avoid compilation failures or ABI mismatches. With this we can tell the type legalization process and the X86 DAG combines that only 256 bit vectors are legal. With this the type legalization process should be carried out in a very similar way to AVX2, splitting the wider types from the vectorizer in the same way. But we also gain the AVX512VL specific instructions like scatter and masked operations that aren't present in AVX2.
> 
> I don't think this part is what we should do. We shouldn't change the size of a register, but rather change the optimizers to prefer smaller things. Legal versus preferred is a big difference.

I disagree here... I think type legalization is a necessary tool to meaningfully achieve a preferred type. I think we would end up building a mini legalizer elsewhere in the optimizer if we remove the ability of the current type legalizer logic to handle this, and that doesn't seem ideal. I also think there is a better way to split this hair....

> 
> 
>> What I'm looking for is a way to guarantee that with prefer-vector-width=256, starting from scalar code and vectorizing it that we won't end up with 512 registers in the output. Even a few of them are enough to trigger than frequency reduction penalty I am trying to avoid. Given the current behavior of IR passes and their expectations of what codegen can handle, I don't see a way to make this guarantee from IR. And it would be easy to break in the future as new passes or passes are modified. The codegen type legalization process seems the best place to handle this. But we need to be able to communicate when it is safe to do so.

I think this core requirement gets at the solution I would propose to this.

I think at the IR level we should simply replace the preferred width with required width. We should require the frontend to pick the width that it wants code generated for, and stick to it.

Then we can demand that the frontend, if it wants to explicitly generate 512-bit code through things like intrinsics, *increases* the required width for that function. This allows explicit AVX-512 _mm_* intrinsics to work correctly.

Then we teach the inliner how to cope with this intelligently. My guess is that this won't be trivial to do because of patterns like:

  if (DynamicTestIfTargetHasAVX512()) {
    CallFunctionWithAVX512Instrinsics();
  } else {
    CallFunctionWithLoopThatShouldBeAutoVectorizedTo256BitVectors();
  }

Here, both naive solutions may prove unfortunate: increasing the callers required width in the IR to match the callees and allowing the inline might in theory regress the generated code for the other function, but blocking the inlining would almost certainly regress the code.

However, I think in practice inlining will be the correct call for the vast majority of cases. Most of the really interesting cases won't be inline candidates anyways due to different *subtargets* that already influences inlining decisions. So generally, I think it is safe to let the inliner take the max of the required widths and proceed to inline. If we find interesting cases that require more complex logic, we can address them once we have that test case.

One possible alternative is that we could make an explicit vs. implicit distinction and allow inlining to increase an implicit width but block inlining that would increase an explicitly narrow width. This might make it easier to write code like:

  if (NumberOfItemsIsVeryLarge) {
    CallFunctionWithWideVectorsButLowerFreq();
  } else {
    CallFunctionWithNarrowVectorsAndHigherFreq();
  }

The critical result here is that if the user code *already contains* explicitly 512-bit width vector code, it seems reasonable for the vectorizer to alse produce 512-bit width vector code. So I don't think the IR needs to model a preference and a requirement independently in the IR for the purposes of the code generator. At most, it might want this for the purpose of controlling inlining and merging.

https://reviews.llvm.org/D43306