[PATCH][X86] AVX512: Add vbroadcasti*

Thu Jun 26 17:53:33 PDT 2014

OK, I poked around more and I think I know what I was missing.  Surprisingly (at least to me), X86VBroadcast is defined not only for scalar but also for vector input types.  E.g. for AVX:

// scalar:
    def : Pat<(v4f64 (X86VBroadcast FR64:$src)),
              (VBROADCASTSDYrr (COPY_TO_REGCLASS FR64:$src, VR128))>;
// vector:
  def : Pat<(v4f64 (X86VBroadcast (v2f64 VR128:$src))),
          (VBROADCASTSDYrr VR128:$src)>;

I would have expected that its input would have to be a scalar type (occupying a vector register, naturally).

Confusingly, for memory variants the input is of course scalar since there the input type also determines the memory access type.  E.g.:

def VBROADCASTSDYrm {
…
  list<dag> Pattern = [(set VR256:$dst, (v4f64 (X86VBroadcast (loadf64 addr:$src))))];

Ideally, we would want to differentiate scalar vs. sub-vector/tuple broadcasts (VBROADCAST[IF]*) by disambiguating on the input type.  We can still probably do it because sub-vector broadcast is only supported on memory operands but it feels less clean.  E.g.:

(v8i64 (X86VBroadcast (v4i64 (load addr:$src)))) is a sub-vector broadcast vs.
(v8i64 (X86VBroadcast (v4i64 VR256:$src))) is a scalar broadcast of the first element 
(v8i64 (X86VBroadcast (i64 (load addr:$src)))) is a scalar broadcast as well with a memory operand

Nadav, Jim, do you guys have any opinion on this?

In order to move on with this until we have the discussion going, I checked in the version Elena suggested.  It’s r211828. 

Thanks,
Adam

On Jun 26, 2014, at 6:38 AM, Demikhovsky, Elena <elena.demikhovsky at intel.com> wrote:

> 
> I just suggest to separate broadcasting of one element from broadcasting of sub-vector.
> If you are not masking, the same broadcast instruction may be used for multiple types
> 
> (v8i64 (X86SubVectorBroadcast (v4i64 (load ..))) - VBROADCASTI64X4
> (v16i32 (X86SubVectorBroadcast (v8i32 (load ..))) - VBROADCASTI64X4
> 
> 
> multiclass avx512_int_subvec_broadcast_rm<bits<8> opc, string OpcodeStr,
>                          X86MemOperand x86memop, PatFrag ld_frag,
>                          RegisterClass DstRC, ValueType OpVT, ValueType SrcVT,
>                          RegisterClass KRC> {
>  let mayLoad = 1 in {
>  def rm : AVX5128I<opc, MRMSrcMem, (outs DstRC:$dst), (ins x86memop:$src),
>                  !strconcat(OpcodeStr, " \t{$src, $dst|$dst, $src}"),
>                  [(set DstRC:$dst, 
>                    (OpVT (X86SubVectorBroadcast (ld_frag addr:$src))))]>, EVEX;
>  def krm : AVX5128I<opc, MRMSrcMem, (outs DstRC:$dst), (ins KRC:$mask,
>                                                         x86memop:$src),
>                  !strconcat(OpcodeStr, 
>                      " \t{$src, ${dst} {${mask}} {z}|${dst} {${mask}} {z}, $src}"),
>                  [(set DstRC:$dst, (OpVT (X86SubVectorBroadcast KRC:$mask, 
>                                     (ld_frag addr:$src))))]>, EVEX, EVEX_KZ;
>  }
> }
> 
> -  Elena
> 
> 
> -----Original Message-----
> From: Adam Nemet [mailto:anemet at apple.com] 
> Sent: Wednesday, June 25, 2014 21:40
> To: Demikhovsky, Elena
> Cc: llvm-commits
> Subject: Re: [PATCH][X86] AVX512: Add vbroadcasti*
> 
> Hi Elena,
> 
> On Jun 25, 2014, at 12:35 AM, Demikhovsky, Elena <elena.demikhovsky at intel.com> wrote:
> 
>> Hi Adam,
>> 
>> I don't think that we should mix instructions like VBROADCASTI32X4 to regular, one element, broadcast.
>> 
>> I suggest to write one more template specially for these 2 instructions. It may be used for inserting sub-vector, where sub-vector is loaded from memory.
>> I want to say  that lowering model of these 2 instructions is different. You can leave the substitution pattern empty.
> 
> Can you please be elaborate?  In this patch I just applied my conclusions from my AVX2 broadcast work (http://lists.cs.uiuc.edu/pipermail/cfe-commits/Week-of-Mon-20140526/106526.html and the first patch in http://lists.cs.uiuc.edu/pipermail/llvm-commits/Week-of-Mon-20140526/219238.html).
> 
> To summarize, I was suggesting to use the same lowering for all broadcasts there without using LLVM intrinsics.  At least that is the plan I outlined there for VBROADCAST[IF]128 which correspond to these instructions.  The main benefit is that exposing the memory loads rather than hiding them in an intrinsic opens them up to optimizations (e.g. LICM).
> 
> The idea is to use generic nodes to implement these front-end builtins then combine them into X86VBroadcast SDNodes.  In the case of vbroadcasti4x64 this would amount to: (v8i64 (X86VBroadcast (v4i64 (load ...)))).
> 
> Do you disagree with this direction for codegen?
> 
> I can of course keep them separate for now until I have prototyped either AVX2's vbroadcasti128 or the AVX512 counterparts.  What I want to make sure at this point that we're on the same page moving toward more of the codegen work.
> 
> Thanks,
> Adam
> 
>> -  Elena
>> 
>> -----Original Message-----
>> From: Adam Nemet [mailto:anemet at apple.com]
>> Sent: Wednesday, June 25, 2014 00:07
>> To: Demikhovsky, Elena
>> Cc: llvm-commits
>> Subject: [PATCH][X86] AVX512: Add vbroadcasti*
>> 
>> Hi Elena,
>> 
>> I have modified the avx512_int_broadcast_rm multiclass to fit these instructions as well.
>> 
>> The individual patches provide a bit more information.  Please let me know if it looks good to you.
>> 
>> Thanks,
>> Adam
>> 
>> ---------------------------------------------------------------------
>> Intel Israel (74) Limited
>> 
>> This e-mail and any attachments may contain confidential material for 
>> the sole use of the intended recipient(s). Any review or distribution 
>> by others is strictly prohibited. If you are not the intended 
>> recipient, please contact the sender and delete all copies.
>> 
> 
> ---------------------------------------------------------------------
> Intel Israel (74) Limited
> 
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
>