[PATCH] D43306: [X86] Add pass to infer required-vector-width attribute based on size of function arguments and use of intrinsics
Craig Topper via Phabricator via llvm-commits
llvm-commits at lists.llvm.org
Wed Feb 28 11:59:12 PST 2018
craig.topper added a comment.
Attempting to clarify some things.
Prior to any of my prefer vector width changes, the TTI interface informed the loop vectorizer about the maximum register width in hardware via the getRegisterBitWidth method. The loop vectorizer uses this information as a guideline, but not a strict limit of what types it can create. My understanding is that there are two modes, the default mode finds the largest scalar width of loads, stores, and phis within the loop. It then determines how many of this scalar type fit in the register width received from TTI. This determines the VF factor. The second mode, called maximize bandwidth, uses the smallest scalar width instead. I'm not sure if that's still restricted to loads, store, and phis or not.
So for both modes, if there are any larger scalar types in the loop, those types will be multiplied by the VF factor and cause the vectorizer to create types that are larger than the maximum hardware register size. These types will exist all the way to codegen, and it is up to the SelectionDAG type legalization process or target specific DAG combines to split these larger operations. In some cases the larger types are used to represent idioms that contain extends and truncates that we were able to combine to more complex instructions without doing a split. X86 probably has more pre type legalization target specific DAG combines here than most other targets. And we are doing our own splitting for type legalization in some of these combines. This prevents the type legalizer from needing to deal with target specific nodes. Though the type legalizer is capable of calling ReplaceNodeResults and LowerOperation for target specific nodes.
In addition to the vectorizer, I've believe I've seen evidence of InstCombine creating larger types when it canonicalizes getelementptrs to have pointer width types by inserting sign extends. This shows up in vector GEPs used by gather and scatter intrinsics. So if the GEP used a v16i32 index, InstCombine will turn it into v16i64. X86 tries to remove some of these extends with a DAG combine, but we're not great at finding a uniform base address for gathers which causes the extend to be hidden. So we end up spltting gathers when their result type is the maximum hardware register width, but their index is twice as large.
Based on this current behavior of IR passes with respect to hardware vector width and the fact that codegen already has to handle illegal types. So the first part of my changes(prefer-vector-width), make the vectorizer behavior similar between prefer-vector-width=256 on an avx512f target and an avx2 target by making getRegisterBitWidth return 256 in both cases. So under prefer-vector-width=256, the vectorizer will calculate the same VF factor it would for an AVX2 target, and it will still generate wider types for some things just like it does for AVX2. With just prefer-vector-width, when the IR gets to codegen we end up emitting 512 bit operations for these wider types because the type legalization process can't enforce prefer-vector-width. This is because codegen has no knowledge of the source of the vectors in the IR. They may have come from the vectorizer or they may have been present in the original source code. If they were in the original source code, they might be used by target specific intrinsics, or function parameters, or return types. In those cases we would need to ensure we emit 512 bit vectors or the intrinsics won't assemble or we could have mismatched ABI between two different translation units if one is compiled with prefer-vector-width and one isn't.
The patch presented here overcomes this limitation by providing a required-vector-width attribute that can be used to tell codegen that there is nothing in the IR that strictly requires wider vectors to avoid compilation failures or ABI mismatches. With this we can tell the type legalization process and the X86 DAG combines that only 256 bit vectors are legal. With this the type legalization process should be carried out in a very similar way to AVX2, splitting the wider types from the vectorizer in the same way. But we also gain the AVX512VL specific instructions like scatter and masked operations that aren't present in AVX2.
This patch isn't a perfect solution because it doesn't capture the source containing vectors that don't go through intrinsics or function arguments or return values. But we could emit the same function attribute earlier, in the frontend, to convey that information.
So based on my conversations with others there seem to be some questions about the baseline assumptions here.
-Should the vectorizer be creating wider vectors than what getRegisterBitWidth returns? Changing this likely means we would need to introduce splitting into the vectorizer to avoid hurting SSE/AVX/AVX2 vectorized code. But I don't think that removes the need for the backend to handle illegal types in IR so that would just be introducing another type legalization process.
-Some idioms require larger types to represent in IR, for example, average uses (trunc (add (add (zext X, zext Y), 1)) but we are able to do that in X86 with a single instruction. But the 512-bit version of that instruction requires the zexts be wider than 512 bits in IR.
-Should InstCombine be creating wide illegal vector types to canonicalize GEPs? Are there other places that do similar things? Other passes?
-Is X86 too aggressive about creating target specific nodes early requiring many places to do custom type legalization? The checks are getting centralized with SplitBinaryOpsAndApply so hopefully we won't have to think about all the different split sizes in multiple places going forward.
What I'm looking for is a way to guarantee that with prefer-vector-width=256, starting from scalar code and vectorizing it that we won't end up with 512 registers in the output. Even a few of them are enough to trigger than frequency reduction penalty I am trying to avoid. Given the current behavior of IR passes and their expectations of what codegen can handle, I don't see a way to make this guarantee from IR. And it would be easy to break in the future as new passes or passes are modified. The codegen type legalization process seems the best place to handle this. But we need to be able to communicate when it is safe to do so.
Would this patch be more acceptable in the near term if I do one of the following:
-Get ride of the required-vector-width attribute and have this pass force a prefer-vector-width attribute to be at least as large as any ABI or intrinsic constraints? Then make the X86 codegen use prefer-vector-width to control the type legalizer. This would mean some places in x86 codegen would create new 512-bit operations that weren't in the IR since we'd be overriding the preference.
-Rename the required-vector-width attribute to x86-required-vector-width so that it is clear it is x86 specific for now. And we could only given a generic name if we decide to emit it earlier than x86 codegen.
Either of these would allow people to start experimenting with prefer-vector-width=256. We would need to revisit and make changes before we could make prefer-vector-width=256 the default for SKX.
More information about the llvm-commits