<html><head><meta http-equiv="Content-Type" content="text/html; charset=us-ascii"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">Hi,<div class=""><br class=""></div><div class="">Sorry for jumping in a bit late, I missed the initial discussion.<br class=""><div><br class=""><blockquote type="cite" class=""><div class="">On Jul 21, 2020, at 05:13, Baptiste Saleil via cfe-dev <<a href="mailto:cfe-dev@lists.llvm.org" class="">cfe-dev@lists.llvm.org</a>> wrote:</div><div class=""><div dir="ltr" style="caret-color: rgb(0, 0, 0); font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none;" class=""><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, 22 Jun 2020 at 19:01, Hal Finkel <<a href="mailto:hfinkel@anl.gov" class="">hfinkel@anl.gov</a>> wrote:</div><blockquote class="gmail_quote" style="margin: 0px 0px 0px 0.8ex; border-left-width: 1px; border-left-style: solid; border-left-color: rgb(204, 204, 204); padding-left: 1ex;"><div class=""><div class="">On 6/19/20 9:31 PM, Baptiste Saleil via cfe-dev wrote:</div><blockquote type="cite" class=""><div dir="ltr" class="">The new instructions manipulate matrices that the CPU represents by new 512-bit<br class="">registers called `accumulators`. Copying matrices, modifying values and<br class="">extracting values of matrices may cause the CPU to copy values from/to the<br class="">matrix multiplication unit. To avoid degrading performance, we thus want to<br class="">minimize the number of times these operations are used. So the user will be able<br class="">to modify and extract values of the matrices and perform computations with them<br class="">by using the dedicated builtins only. The instructions are designed to be used in<br class="">computational kernels and we want to enforce that specific workflow.<br class=""><br class="">Because of this restriction, we cannot rely on the target-independent matrix<br class="">types [1].</div></blockquote><p class=""><br class=""></p><p class="">If this is part of the documented system ABI, and what will be supported by GCC, then we should support it too.</p><p class="">That having been said, I'm not convinced that this is a good idea, and supporting the target-independent matrix types would be better. I understand that the copying will be expensive, and is something that should be avoided, but this is true to some extent for everything: there are some usages that compile to machine code efficiently and some that don't. We generally, however, favor the ability to create abstractions that *can* be compiled efficiently as part of expected use cases, even if we cannot guarantee that all uses will produce efficient code. In his case, you're prohibiting the creation of abstractions (by semantically restricting to local variables) because you fear that not all uses will compile to efficient code. Are there some other structural reasons why supporting these are regular values would be problematic?<br class=""></p></div></blockquote>Supporting these as regular values would be problematic for several reasons. These new accumulator registers are actually each associated with 4 of the existing 128-bit VSR vector registers. A particularity of MMA is that when an accumulator contains defined data, its 4 associated registers contain undefined data and cannot be used. When copying an accumulator, we need to:<br class=""> 1. Copy its value back to its four associated VSRs<br class=""> 2. Copy these 4 VSRs to the VSRs associated with the destination accumulator<br class=""> 3. Copy these VSRs to the destination accumulator<span class="Apple-converted-space"> </span><br class=""> 4. If the copy is not a kill, copy the 4 VSRs associated with the source back to the source accumulator<br class=""><br class="">So if these registers were supported as regular values, we would have really expensive copy (and also expensive function calls and returns) and we would prevent from using 4 vector registers per live accumulator. More importantly (something I should have mentioned in the RFC), the new instructions actually implement a single operation that is the outer product. That means that supporting these as regular values would imply copying accumulators back to their associated VSRs and generating non-MMA instructions for any other operation anyway. Therefore, it is likely that programs using matrices would actually be less efficient.<br class=""><br class=""></div></div></div></blockquote><div><br class=""></div><div>Form the reasoning above, it sounds like there seem to be no structural reasons that prevent using the matrix type extension, unless I am missing something.</div><div><br class=""></div>But if I understand correctly, the main motivation for introducing the new types with the additional restrictions is to prevent users from writing code that cannot be mapped directly to the hardware?</div><div><br class=""></div><div>In particular, is a concern with the matrix types extension that a user could specify a matrix operation that cannot be mapped directly to the MMA extension, e.g. a multiple of 13 x 7 float matrixes?</div><div>And specify costly accesses, for example repeated access to elements that live in different VSR registers?</div><div><br class=""></div><div><blockquote type="cite" class=""><div class=""><div dir="ltr" style="caret-color: rgb(0, 0, 0); font-family: Helvetica; font-size: 12px; font-style: normal; font-variant-caps: normal; font-weight: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration: none;" class=""><div class="gmail_quote"><div class="">However, although we're not planning on supporting the target-independent matrix types for these reasons, we're not excluding supporting the target-independent matrix operations. We are exploring implementing the target-independent matrix multiplication operation with MMA kernels. That way, on PowerPC, programs using target-independent matrix types and operations would actually benefit from MMA for matrix multiplication with no additional effort.</div></div></div></div></blockquote></div><br class=""></div><div class="">We recently started working on providing some infrastructure to allow for target-specific lowering upstream. Any collaboration on that front would be very welcome, to make sure things are general enough to support many different hardware extensions.</div><div class=""><br class=""></div><div class="">Cheers,</div><div class="">Florian</div></body></html>