[PATCH] D32530: [SVE][IR] Scalable Vector IR Type

Thu May 2 09:35:48 PDT 2019

huntergr added a comment.

In D32530#1488001 <https://reviews.llvm.org/D32530#1488001>, @cameron.mcinally wrote:

> > If you know the exact size of your vectors at compile time, then I believe fixed length vector types should be used, at least at the IR level.
>
> Ah, ok. So let's say we're targeting SVE and I know my vector length is 512b. I was imagining building a vector like this:
>
>   %r1 = insertelement <scalable 2 x double> undef, double %x, i32 0  
>   %r2 = insertelement <scalable 2 x double > %r1, double %x1, i32 1
>   %r3 = insertelement <scalable 2 x double > %r2, double %x2, i32 2
>   %r4 = insertelement <scalable 2 x double > %r3, double %x3, i32 3
>   %r5 = insertelement <scalable 2 x double > %r4, double %x4, i32 4
>   %r6 = insertelement <scalable 2 x double > %r5, double %x5, i32 5
>   %r7 = insertelement <scalable 2 x double > %r6, double %x6, i32 6
>   %r8 = insertelement <scalable 2 x double > %r7, double %x7, i32 7
>   

So you could do that (judging by the rest of the discussion), but you'd have poison values (effectively) if you exceeded the runtime length. I guess if you're treating scalable vectors as pseudo-fixed-length then you don't care about using the same binary on different hardware.

> But it sounds like I should rather just build a <8 x double> vector instead. Would that <8 x double> vector legalize to SVE vector instructions? Or would it be split into four 128b NEON vectors?

In the current code, you'd get 4 NEON vectors. In future we'd implement fixed length SVE support as well (for SLP autovec without introducing extra predicate generation/branching), but the recommended method of loop autovec for SVE would be VLA. For your own work you'd be able to use fixed-length types then, but we're still figuring out the design (to minimize the number of ISel patterns).

>> The performance of insert/extract is target dependent. For SVE you almost always need a predicate for a single element insert or extract, so you're not going to gain anything by knowing the size ahead of time.
> 
> So back to my scalable vector example above, it sounds like even if the above IR was valid, we'd still have to produce predicates at the hardware instruction level. So maybe my concern is moot...

Yes, though if the index is known to be within a certain range (5bit signed immediate), then you can skip generating a splat and just compare against the stepvector directly; so for your 512b example cpu, you'd be able to generate one fewer instruction when indexing 32b or 64b elements. If you're building up an entire vector (as in your IR), the `insr` instruction will shift all elements along by one and insert into the first lane, so no predicates would be needed -- but we would need to pattern match and optimize for this case.

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D32530/new/

https://reviews.llvm.org/D32530