[LLVMdev] Reducing Generic Address Space Usage

Wed Mar 26 08:48:19 PDT 2014

On Tue, Mar 25, 2014 at 02:31:05PM -0700, Jingyue Wu wrote:
> This is a follow-up discussion on
> http://lists.cs.uiuc.edu/pipermail/cfe-commits/Week-of-Mon-20140324/101899.html.
> The front-end change was already pushed in r204677, so we want to continue
> with the IR optimization.
> 
> In general, we want to write an IR pass to convert generic address space
> usage to non-generic address space usage, because accessing the generic
> address space in CUDA and OpenCL is significantly slower than accessing
> non-generic ones (such as shared and constant),.
> 
> Here is an example Justin gave:
> 
>   %ptr = ...
>   %val = load i32* %ptr
> 
> In this case, %ptr is a generic address space pointer (assuming an address
> space mapping where 0 is generic).  But if an analysis can prove that the
> pointer %ptr was originally addrspacecast'd from a specific address space
> (or some other mechanism through which the pointer's specific address space
> can be determined), it may be beneficial to explicitly convert the IR to
> something like:
> 
>   %ptr = ...
>   %ptr.0 = addrspacecast i32* to i32 addrspace(3)*
>   %val = load i32 addrspace(3)* %ptr.0
> 
> Such a translation may generate better code for some targets.
>

I think a slight variation of this optimization may be useful for the
R600 backend.  One thing I have been working on is migrating allocas
to different address spaces, which in some cases may improve
performance.  Here is an example:

%ptr = alloca [5 x i32]
...

Would become:

@local_mem = internal addrspace(3) unnamed_addr global [5 x i32]

%ptr = addrspacecast [5 x i32] addrspace(3)* @local_me to i32*
...

In this case I would like all users of %ptr to read and write
address space 3 rather than address space 0, and it sounds like your
proposed optimization pass could do this.

> There are two major design decisions we need to make:
> 
> 1. Where does this pass live? Target-independent or target-dependent?
> 
> Both NVPTX and R600 backend want this optimization, which seems a good
> justification for making this optimization target-independent.
> 

I agree here.

> However, we have three concerns on this:
> a) I doubt this optimization is valid for all targets, because LLVM
> language reference (
> http://llvm.org/docs/LangRef.html#addrspacecast-to-instruction) says
> addrspacecast "can be a no-op cast or a complex value modification,
> depending on the target and the address space pair."

Does it matter that it isn't valid for all targets as long as it is
valid for some?  We could add it, but not run it by default.

> b) NVPTX and R600 have different address numbering for the generic address
> space, which makes things more complicated.

Could we add a TargetLowering callback that the pass can use to determine
whether or not is is profitable to replace one address space with
another?

-Tom

> c) We don't have a good understanding of the R600 backend.
> 
> Therefore, I would vote for making this optimization NVPTX-specific for
> now. If other targets need this, we can later think about how to reuse the
> code.
> 
> 2. How effective do we want this optimization to be?
> 
> In the short term, I want it to be able to eliminate unnecessary
> non-generic-to-generic addrspacecasts the front-end generates for the NVPTX
> target. For example,
> 
> %p1 = addrspace i32 addrspace(3)* %p0 to i32*
> %v = load i32* %p1
> 
> =>
> 
> %v = load i32 addrspace(3)* %p0
> 
> We want similar optimization for store+addrspacecast and gep+addrspacecast
> as well.
> 
> In a long term, we could for sure improve this optimization to handle more
> instructions and more patterns.
> 
> Jingyue