[LLVMdev] Supporting heterogeneous computing in llvm.

C Bergström cbergstrom at pathscale.com
Sat Jun 6 13:52:57 PDT 2015


>>
>> Anyway, to bring this conversation back to something technical instead
>> of just stupid comments.. I'd agree that flipping targets back and
>> forth (intermixed) in the same Module *is* probably a substantial
>> amount of work. If the optimization passes worked at a PU (program
>> unit) aka function level it wouldn't be.
>>
>
> It's just another level of indirection essentially - and a lot of work. It's
> much easier to do what's being proposed and outline work into another
> module. To do what you've said (and I've looked at) is basically turning
> each function into it's own little module - ala what the ORC JIT does with
> per-function compilation.

/* Non-jit example - Old Pro64/MIPSPro from SGI is per PU as well..
I'm not sure what kernelgen is doing.. */

I'm not sure I was clear - I'lll try to elaborate

You take the region of code or cuda kernel. etc being offloaded and
outline it into a seperate PU (function) which goes into a new module,
which is appended to the 1st.

This isn't exactly the clang model today, but *if* llvm is a library -
it's easier to handle the 2 modules one after the other.

>
>>
>> Why can't you append 1 Module after another and switch?
>
>
> This is, effectively, two modules and it'll behave the same. The reasons are
> data transfer etc for module level attributes, data layout, etc. We've still
> got some lingering issues at the function level let alone at the module
> level with side data taking over. Akira and I are working on them as we can.

cool - good to hear.


>
>>
>>
>> As you point out whole program analysis/optimization will face a
>> similar problem - same question as above.
>> ---------------------
>> Currently - (I don't know about DSP - TI/Qualcomm), but most people in
>> the industry are using custom runtimes to parse the GPU code and
>> load/execute. It would be great if the linker/loader actually had
>> better support for this built-in.
>>
>> I don't know the exact capabilities of gnu/sun linker/loader, but
>> something along the lines of managling the function to also include
>> target details
>>
>> so compiler would emit multiple mangled versions of foo() and
>> linker/loader could pick the most optimized.
>>
>> Something like this
>> nvc0_foo
>> avx2_foo
>> avx512_foo
>> (Also I'd agree that the above would be quite hard)
>
>
> There's quite a bit of work in this direction in a lot of different ways.
> You can take a look at the gnu ifunc ELF extensions as a way of doing this
> on a per-subtarget feature level. The obvious extension of this to
> accelerators is something that we've had discussions about (GNU Tools
> Cauldron a couple of years ago) and I believe it's been discussed as part of
> a C++ working group.

The ifunc stuff doesn't behave exactly as I'd like. It's sorta close.
Another example - On solaris at boot time they have a check for the
system capabilities and mount over libc/m with the most optimized
version the system is capable of. When I first saw this I thought it
was quite clever and cool. (Many years ago) Doing that for
accelerators wouldn't exactly work though - since they can hang and be
(slightly?) less reliable than the CPU. (Not to mention busy)

The upside to this is less work for the loader. The downside is you
have to build multiple versions of libc and friends.

>
> At any rate, it's a much bigger discussion than a weekend on the mailing
> list, but there's been some thought about how it'll need to happen on each
> architecture/OS and, as you can tell, it's a matter of ongoing
> experimentation and development. (References: CUDA work, Movidius work,
> etc).

Yeah I agree - I probably won't be sending a patch any time soon, but
I thought I could ask questions around designs that I know have
functionally worked.



More information about the llvm-dev mailing list