<br><br>On Tuesday, February 5, 2019, Simon Moll <<a href="mailto:moll@cs.uni-saarland.de">moll@cs.uni-saarland.de</a>> wrote:<div><br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

I think this is the usual mixup of AVL and MVL.<br>

<br>

AVL: is part of the predicate</blockquote><div><br></div><div>Mmm that's very confusing to say that AVL is part of the predicate.  It's.... kiinda true?</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><br></blockquote><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> and can change between vector operations just like a mask can (light weight).</blockquote><div><br></div><div>Yes, ok, it's more that it is an "advisory". In RVV the program (the instruction) *requests* a specific AVL and the processor responds with an *actual* AVL of between 0 (yes really, zero) and MIN(MVL, requested_AVL).</div><div><br></div><div>To say that it's a predicate, well... a predicate mask, you set it, and the mask is obeyed, period. AVL, that just doesn't happen.</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

MVL: Is the physical vector register length and can be re-configured per function (RVV only atm) - (heavy weight, stop-the-world instruction).</blockquote><div><br></div><div>My understanding of RVV is that MVL is intended to be more of a hardcoded parameter that is part of the processor design. Any compiler should be generating code that really does not need to know what MVL is.</div><div><br></div><div>SV is slightly different, due to the fact that we use the *scalar* regfile as if it was a typecasted SRAM. The register number in any given instruction is just a pointer to the SRAM address at which vector elements i8/16/32/64 are read/written.</div><div><br></div><div>So in SV we need to *set* the MVL, otherwise how can the engine know the point where it has to stop reading/writing to the register SRAM?</div><div><br></div><div>However what is most likely to happen is, MVL will be set globally to e.g 4 and be done with it.</div><div><br></div><div>SV semantics for AVL are also slightly different from RVV, not by much though. The engine is not permitted to choose arbitrary values: if AVL is requested to be set to 4, it must *be* set to MIN(MVL, 4).  This can sometimes avoid the need for a loop, entirely (short vectors).</div><div><br></div><div>Note also that in SV, neither AVL nor MVL may be set to zero. AVL=1 indicates that the engine is to interpret instructions in SCALAR mode.</div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

The vectorlen parameter in EVL intrinsics is for the AVL.<br>

<br></blockquote><div><br></div><div>Ok so there is a bit of a problem, for both SV and RVV, in that both can end up with different AVL values from what is requested.</div><div><br></div><div>If the API expects that when AVL elements are to be processed, that exactly that number of elements *will* have been processed, that is simply not the case and that assumption will result in a catastrophic failure, elements not being processed.</div><div><br></div><div>To deal with that, if it is a hard requirement of the API that exactly the number of AVL ops are carried out as requested, an otherwise completely redundant assembly code for-loop will have to be generated.</div><div><br></div><div>Oh and then outside of that loop would be the IR level inner loop that was actually part of the user's program.</div><div><br></div><div>Basically what I am saying is that the semantics "request an AVL from the hardware and get an ACTUAL number of elements to be processed" really needs to become part of the API.</div><div><br></div><div>Now, fascinatingly, for SIMD-only style architectures, that could hypothetically be used to communicate to the JIT engine converting the IR to use progressively smaller SIMD widths, on architectures that have multiple widths.  Also to indicate when corner-case cleanup is to be used.  (SIMD alteady being a mess, this would all not be high priority / optimised)</div><div> </div><div>OR...</div><div><br></div><div>the inner workings of AVL are entirely hidden and opaque to the IR. The IR sets the total explicit number of elements, and It Gets Done.</div><div><br></div><div>However I suspect that doing that will open a can o worms.</div><div><br></div><div><br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<br>

I'm curious what SVE will do if there is an if/then/else in the middle<br>

of a vectorised loop with a shorter-than-maximum vector length. You<br>

can't just invert the mask when going from the then-part to the<br>

else-part because that would re-enable elements past the end of the<br>

vector. You'd need to invert the mask and then AND it with the mask<br>

containing the (bitwise representation of) the vector length.<br>

</blockquote></blockquote></blockquote><div><br></div><div>Yep, that is a workable solution for fixed width (SIMD) architectures, it is a good pattern to use.</div><div><br></div><div> As I mentioned earlier (about the mistake of using gather/scatter as a means and method of implementing predication), it would be a mistake to try to "dumb down" this proposal to cater for fixed-length SIMD engines to the detriment of dynamic-length engines.</div><div><br></div><div>If you try that then all the advantages of dynamic-length ISAs are utterly destroyed, as the only way to implement the compliance with a dumbed-down fixed-length proposal is: for variable-length ISAs to issue brain-dead FIXED length assembly code.</div><div><br></div><div>Whereas if the API can cope with variable length, the length that is returned for a SIMD engine may be one of the multiples of SIMD widths that that engine supports, can use scatter/gather as a substitute for (potential) lack of predication masks and so on.</div><div><br></div><div>If as an industry we want to break free of the seductively broken SIMD paradigm, then variable-length engines need to be given top priority.</div><div><br></div><div>Really. and again, I say that with profuse apologies to all engineers who have to deal with SIMD. I know it's so much easier to implement at the hardware level, it's just that SIMD has always made the compiler writers job absolute hell.</div><div><br></div><div>L.</div><div><br></div><div> </div></div><br><br>-- <br>---<br>crowd-funded eco-conscious hardware: <a href="https://www.crowdsupply.com/eoma68" target="_blank">https://www.crowdsupply.com/eoma68</a><br><br>