[PATCH] Make SROA handle addrspacecast

Jingyue Wu jingyue at google.com
Wed Jun 17 14:29:31 PDT 2015


Regarding your question on why these addrspacecasts even exist, the short answer is that emitting addrspacecasts after alloca eases NVPTX's address space inference which lowers access to special address spaces into fast loads/stores.

The long answer requires some background on CUDA and how the NVPTX backend works. I tried to make it brief. Let me know if it makes sense.

First of all, address spaces are not included in CUDA's type system. All pointers in a CUDA source code are generic. Only when a variable is declared, the programmer can annotate the address space that this variable lives in. For example, `__shared__ int *a = xxx;` means `a`, the pointer itself, lives in the shared address space rather than `a` points to an `int` that lives in the shared space.

So, LLVM's middle-end and/or the NVPTX backend bear the burden of optimizing the memory accesses to variables in special address spaces. Given an address-taken variable annotated with a special address space, the clang front-end simply casts the resultant address to generic (i.e. `addrspace(0)`) right after the declaration. The optimizer needs to eliminate unnecessary addrspacecasts and emits fast loads/stores.

Currently, the pass that performs this address space optimization in NVPTX is NVPTXFavorNonGenericAddrSpaces. It works by pushing addrspacecast (from special to generic) towards loads and stores as far as possible. After that, the codegen can simply lower `load/store <ty>, <ty> addrspace(x)*` to fast loads/stores in machine code. For example, if a function reads from a shared variable, the IR clang emits without optimization looks like

  @a = addrspace(3) global float
  define ... {
    %1 = addrspacecast float addrspace(3)* @a to float* ; cast from special to generic
    %2 = load float, float* %1
    ...
  }

then NVPTXFavorNonGenericAddrSpaces optimizes it to

  @a = addrspace(3) global float
  define ... {
    %2 = load float, float addrspace(3)* @a
    ...
  }

and finally NVPTX's codegen lowers this "load from shared" to a `ld.shared.f32` instruction.

This pass works well for `GlobalVariable`s whose address spaces are explicitly annotated.  However, `alloca` always returns `addrspace(0)` pointers, so the clang front-end doesn't (need to) emit any explicit addrspacecast from local to generic in the initial IR. This creates troubles for the address space inference, because NVPTXFavorNonGenericAddrSpaces has no addrspacecasts to push for accesses to the local address space.

To address this issue, we plan to run another pass called NVPTXLowerAlloca (http://reviews.llvm.org/D10483) that emits two addrspacecasts after an alloca, one casting generic to local and the other casting local back to generic. For example,

  define ...() {
    %1 = alloca float
    store float 0, float* %1
  }

becomes

  define ...() {
    %1 = alloca float
    %2 = addrspacecast float* %1 to float addrspace(4)*
    %3 = addrspacecast float addrspace(4)* %2 to float*
    store float 0, float* %3
  }

Then, NVPTXFavorNonGenericAddrSpaces pushes the second addrspacecast as usual towards the store.

  define ...() {
    %1 = alloca float
    %2 = addrspacecast float* %1 to float addrspace(4)*
    store float 0, float addrspace(4)* %2
  }

This is why NVPTX emits addrspacecasts after alloca at some stage.

If having SROA to handle addrspacecast of allocas is too alarming, I think the easiest way to fix the issue on our side is to run SROA between NVPTXLowerKernelArgs (another address space inference pass, which must run before SROA) and NVPTXLowerAlloca. It will work in the short term. However, Justin Holewinski and I have some concerns on how much we depend on correctly ordering these address-space inference passes, and think we should merge all of them into one big pass. If we want that merging to happen, we won't be able to run SROA in the middle. Still, workarounds exist such as merging most of the address space inference and only running a small part before SROA.


================
Comment at: lib/Transforms/Scalar/SROA.cpp:1825
@@ +1824,3 @@
+  if (Ptr->getType() != PointerTy) {
+    Ptr = IRB.CreatePointerBitCastOrAddrSpaceCast(Ptr, PointerTy,
+                                                  NamePrefix + "sroa_cast");
----------------
chandlerc wrote:
> sanjoy wrote:
> > Will this transform a "gep(addrspacecast X), Y)" to "addrspacecast(gep X, Y')"?  If so, I think we need a target specific hook that does a legality check on the addrspacecast we're about to insert -- I don't think "addrspacecast X" is legal implies an addrspacecast on a GEP derived from X is also legal.
> The whole point of addrspacecast was to avoid the need for target specific checks... I think we just shouldn't transform GEPs around them if that isn't legal everywhere.
Thanks for pointing this out, Sanjoy. I understand your concerns. Curiously, on which existing targets such transformation from `gep(addrspacecast X), Y` to `addrspacecast(gep X, Y)` is illegal? Just asking so that I can keep a concrete example in mind when I later write target-independent optimizations on addrspacecast.

http://reviews.llvm.org/D10482

EMAIL PREFERENCES
  http://reviews.llvm.org/settings/panel/emailpreferences/






More information about the llvm-commits mailing list