[llvm-dev] Does middle-end pass need to consider some special type when doing optimization? Or letting back-end to revert the optimization accordingly?

Thu Mar 18 18:27:36 PDT 2021

Why is that harder than lowering a load <256 x i32> and then bitcast to
x86_amx?

E.g., I see there is in llvm/lib/Target/X86/X86LowerAMXType.cpp a transform:

%src = load <256 x i32>, <256 x i32>* %addr, align 64
%2 = bitcast <256 x i32> %src to x86_amx
-->
%2 = call x86_amx @llvm.x86.tileloadd64.internal(i16 %row, i16 %col, i8*
%addr, i64 %stride64)

Isn't it equivalent, then, to do:

%2 = load x86_amx, x86_amx* %addr, align 64
-->
%2 = call x86_amx @llvm.x86.tileloadd64.internal(i16 %row, i16 %col, i8*
%addr, i64 %stride64)

On Thu, Mar 18, 2021 at 9:29 AM Luo, Yuanke <yuanke.luo at intel.com> wrote:

> But x86_amx represent a tile. The semantics of hardware instruction
> tileloadd is something like ‘llvm.matrix.row.major.load’. How do we lower
> `%v = load x86_amx, x86_amx* %ptr` to tileloadd?
>
>
>
> *From:* James Y Knight <jyknight at google.com>
> *Sent:* Thursday, March 18, 2021 9:09 PM
> *To:* Luo, Yuanke <yuanke.luo at intel.com>
> *Cc:* Florian Hahn <florian_hahn at apple.com>; Wang, Pengfei <
> pengfei.wang at intel.com>; llvm-dev <llvm-dev at lists.llvm.org>
> *Subject:* Re: [llvm-dev] Does middle-end pass need to consider some
> special type when doing optimization? Or letting back-end to revert the
> optimization accordingly?
>
>
>
> Since the x86_amx type has a fixed size of 1024, I would expect `%v = load
> x86_amx, x86_amx* %ptr` to load 1024 bytes of contiguous memory starting at
> %ptr -- I don't see why this should be invalid?
>
>
>
> On Thu, Mar 18, 2021 at 8:53 AM Luo, Yuanke <yuanke.luo at intel.com> wrote:
>
> I mean transforming from “load <256 x i32>*” to “load x86_amx*” is not
> invalid because x86_amx represent a tile and “load x86_amx*” doesn’t
> express its semantics without a stride. Now it looks to me “load x86_amx*”
> is invalid.
>
>
>
> *From:* James Y Knight <jyknight at google.com>
> *Sent:* Thursday, March 18, 2021 8:41 PM
> *To:* Luo, Yuanke <yuanke.luo at intel.com>
> *Cc:* Florian Hahn <florian_hahn at apple.com>; Wang, Pengfei <
> pengfei.wang at intel.com>; llvm-dev <llvm-dev at lists.llvm.org>
> *Subject:* Re: [llvm-dev] Does middle-end pass need to consider some
> special type when doing optimization? Or letting back-end to revert the
> optimization accordingly?
>
>
>
> Err...are you saying this is the expected semantics of a "load x86_amx"
> operation today? That doesn't make much sense...Surely a strided-load
> operation should be spelled `llvm.matrix.column.major.load` in the IR, not
> `load`?
>
>
>
> On Thu, Mar 18, 2021 at 8:17 AM Luo, Yuanke via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
> Thank Florian. I agree with you that pointers to `x86_amx` have different
> semantics than regular LLVM pointer types. First the x86_amx pointer point
> to a 2D array of a big matrix. The data of each row is contiguous, but the
> data on contiguous row is not contiguous in memory. Below picture shows the
> x86_amx load semantics. We need another operand stride to describe the
> stride of each rows. So the semantics for  “load <256xi32>*” and “load
> x86_amx” is different. Because “load <256 x i32>* assume the memory is
> contiguous and load a flat vector.
>
> You also mention that there is no documentation of x86_amx in the langref.
> I’d like to add x86_amx to the document. Is there any process to document
> for a type?
>
>
>
>
>
>
>
> Thanks
>
> Yuanke
>
>
>
> *From:* Florian Hahn <florian_hahn at apple.com>
> *Sent:* Thursday, March 18, 2021 6:03 PM
> *To:* Wang, Pengfei <pengfei.wang at intel.com>
> *Cc:* llvm-dev <llvm-dev at lists.llvm.org>; Luo, Yuanke <
> yuanke.luo at intel.com>
> *Subject:* Re: [llvm-dev] Does middle-end pass need to consider some
> special type when doing optimization? Or letting back-end to revert the
> optimization accordingly?
>
>
>
>
>
>
>
> On Mar 17, 2021, at 10:11, Wang, Pengfei via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>
>
> Hi,
>
>
>
> We are developing prototypes for Intel Advanced Matrix Extensions (AMX)
> [1] programing model in Clang and LLVM [2].
>
> We met several cases when the certain type we added are optimized
> unexpectedly in the middle-end. E.g. optimizing phi + biscast + load:
>
>
>
> From
>
> %a = load <256 x i32>, <256 x i32>* %mem, align 64
>
> … …
>
> %b = phi <256 x i32> [ %a, %label1 ], [%someother, %label2]
>
> %c = bitcast <256 x i32> %b to x86_amx
>
> To
>
> %a = bitcast <256 x i32>* %mem to x86_amx*
>
> %b = load x86_amx, x86_amx*, align 64
>
> … …
>
> %c = phi x86_amx [ %b, %label1 ], [%someother, %label2]
>
>
>
> To prevent such unexpected transforms, we concretely added the type check
> in each point of the optimizations.
>
> Roman pointed out the changes are not the right direction [3], and thought
> it’s bug for backend. While we agreed backend might be able to handle it
> for the functionality, we think it is better to handle it in the midden-end
> since they are negative optimizations for AMX.
>
>
>
> First, let me put some background here:
>
>    1. x86_amx* is different from trivial pointers.
>
> The AMX load instruction is much different from other load instructions.
> It is not only need the memory address but also the shape / stride of the
> tile register. We did some extra work in the backend to deduce the shape
> information from the context. We don’t want the pass to add new x86_amx
> related usage because this will result in the difficulty in deduction. That
> said bitcasting other pointer types to x86_amx* is not trivial as assumed
> here.
>
>
>
> The problem appears to be that this difference is not modeled or specified
> in LLVM IR AFAICT. The current LangRef does not appear to specific
> `x86_amx` to start with. If pointers to `x86_amx` have different semantics
> than regular LLVM pointer types, using regular LLVM pointer types for
> pointers to `x86_amx` may not be appropriate. I’ve not followed the
> previous AMX discussions closely, but it sounds like it may be good to
> reconsider how x86_amx pointers are modeled in LVM IR.
>
>
>
> Also note that `bitcast` is specified as `no-op` (
> https://llvm.org/docs/LangRef.html#id293) (expect for pointers with
> different address spaces), but from what you mentioned above this does not
> match the semantics for `x86_amx*`. It sounds like this is the underlying
> problem that should be addressed, because trying to update various middle
> end optimization tot ry to enforce the special semantics does not seem to
> be a scalable solution.
>
>
>
> As Nuno mentioned, you could try and use a separate address space for
> `x86_amx` pointers to avoid pointer optimizations.
>
>
>
>
>    1. The physical tile registers have more limitations.
>
>
>    1. No copy instruction between tile registers.
>       2. Spilling / reload a tile register is expensive in light of its
>       size is 1024 bytes.
>       3. The shapes of tile registers need to be pre-configured before
>       use and all data in tile registers will turn into invalid once
>       re-configured. That said we need to dominate as more tile registers as
>       possible to configure their shapes with one configure instruction,
>       otherwise we need to spill and reload the live registers once we need to
>       re-configure.
>       4. The number of tile registers is rather small (only 8) and
>       different shapes cannot be reused.
>
> Based on the limitations, we need to reduce the use / live range of tile
> registers. But optimizations may increase the opportunity of the use. So
> even we can handle some combined operation for AMX type, we still prefer to
> prevent it from the beginning. Unless we can totally roll back the
> optimization. Which is also not a good solution in my opinion.
>
>
>    1. For more information, please refer to discussion in [3].
>
> For other optimization points, please refer [4][5].
>
>
>
> I think the main controversy from Roman is if middle-end pass should
> consider some special type when doing optimization. I tend to let
> middle-end do the type check on account of the peculiarity of AMX type. But
> I’m not sure if we have precedent to handle the similar issue in other
> targets. I’m open and glad to do it either way so long as we have an
> elegant solution.
>
> Any suggestions are welcome.
>
>
>
>
>
>
>
> IIUC the main problem is not that middle-end passes perform or not perform
> optimizations based on certain types. To me it sounds like the actual
> problem is that pointers to `x86_amx` do not behave like regular LLVM IR
> pointers and you are trying to enforce extra restrictions for bit casts.
>
>
>
> Cheers,
>
> Florian
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210318/90cf7021/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image003.jpg
Type: image/jpeg
Size: 31761 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20210318/90cf7021/attachment.jpg>