[LLVMdev] Generalizing shuffle vector
Mon Ping Wang
monping at apple.com
Tue Sep 30 16:37:54 PDT 2008
Hi Eli,
I understand your concern and your question is a good one. In this
specific example, you are right about being more clever in
DAGCombiner. I did start to go down this road and implemented
something similar to what you suggested and it does work reasonably
well for that specific example. There are other cases though that
this approach doesn't work well where by generating insert/extracts,
we lose the structure of the vector program. Let me see if I can add
some more motivation in favor of this change. Consider the following
case where we are doing a transpose by dividing a 16 wide vector into
four 4 wide vectors.
typedef __attribute__(( ext_vector_type(4) )) float float4;
typedef __attribute__(( ext_vector_type(16) )) float float16;
float16 f16;
float4 f4a, f4b, f4c, f4d, t;
f4a = f16.hi.hi; f4b = f16.hi.lo; f4c = f16.lo.hi; f4d = f16.hi.hi;
f4a = __builtin_ia32_unpcklps( f4a, f4c );
t = __builtin_ia32_unpckhps( t, f4c );
f4c = f4b;
f4a = __builtin_ia32_unpcklps( f4a, f4d );
f4b = __builtin_ia32_unpckhps( f4b, f4d );
f16.lo.lo = __builtin_ia32_unpcklps( f4a, f4b );
f16.lo.hi = __builtin_ia32_unpckhps( f4a, f4b );
f16.hi.lo = __builtin_ia32_unpcklps( t, f4c );
f16.hi.hi = __builtin_ia32_unpckhps( t, f4c );
All this code is that if you have <0, 1, 2, 3, 4..., 15>, you will get
<0, 4, 8, 12, 1, 5, 9, 13, 2 ..., 15>. Each of builtins maps to a
vector shuffle. After opt is run, we end up with a extract of each
element and an insert into the correct location. It is difficult for
the code generator to rebuild the correct series of shuffles as it
requires that doing shuffles for one float4 of the result, you are
also doing some work for another float4. It is possible to retain the
structure of the vectors by preventing opt from optimizing away any
shuffle operations but we might miss some optimizations opportunities
when the code is really extracting a few elements and using it in a
scalar context.
From a different perspective, the insert and extracts are breaking up
vectors by scalarizing it elements and then in code generation, we are
rebuilding the vectors again and their operation. Assuming that the
vector code is well written, it will be advantageous for us to retain
the structure of the incoming program and optimize and change it when
we know we can do better than that. With insert and extract, it is
easier for us to lose the structure of the incoming vector as we
optimize it. It just seems cleaner to me that when we manipulate
vectors that they can remain vectors.
-- Mon Ping
On Sep 30, 2008, at 1:24 PM, Eli Friedman wrote:
> On Mon, Sep 29, 2008 at 8:11 PM, Mon Ping Wang <wangmp at apple.com>
> wrote:
>> The problem with generating insert and extracts is that we can
>> generate poor
>> code
>> %tmp16 = extractelement <4 x float> %f4b, i32 0
>> %f8a = insertelement <8 x float> %f8a, float %tmp16, i32 0
>> %tmp18 = extractelement <4 x float> %f4b, i32 1
>> %f8c = insertelement <8 x float> %f8b, float %tmp18, i32 1
>> ...
>> For X86, legalize will convert each insertelement to become a vector
>> shuffle. We are very careful in combining vector shuffles because
>> we don't
>> want to produce a vector shuffle whose mask is illegal or hard to
>> code gen
>> so we end up in this code to generate a sequence of unpcks and
>> movhlps for
>> this. With the new form, Legalize will divide the 8xf32 vector
>> into two
>> 4xf32 and since the two sides are the same, it will generate quad
>> word moves
>> to copy the values.
>
> I think this specific issue can be fixed without extending the
> IL-level syntax; DAGCombiner could easily be made a lot more clever
> about cases like this. For example, before legalization, we can
> transform an INSERT_VECTOR_ELT inserting an element into a constant
> vector or a SCALAR_TO_VECTOR into a BUILD_VECTOR, and we can transform
> BUILD_VECTOR into CONCAT_VECTORS or EXTRACT_SUBVECTOR for relevant
> cases. We can also make the lowering significantly more clever about
> dealing with insertelement.
>
> If what we currently have isn't sufficient, I think the first step is
> to extend VECTOR_SHUFFLE to be more flexible, and implement the
> relevant combines to make that effective. I don't think it makes
> sense to add this extension to the IL before it's proven that we can
> do CodeGen on it reasonably.
>
> I'm not necessarily against this extension, but since we're talking
> about constructs which can already represented in the IL without
> losing too much semantic information, we really want to be going about
> this from the bottom up. It's going to be hard to properly see the
> concrete benefits without a solid baseline to compare against.
>
> -Eli
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
More information about the llvm-dev
mailing list