<font size=2 face="sans-serif">Hi Serguei,</font><br><br><font size=2 face="sans-serif">Thanks a lot for the explanation on

the partial linking step.</font><br><br><br><font size=2 face="sans-serif">> "</font><font size=2 color=#004080 face="sans-serif">the

intent is to change offload link step to always operate with fat objects

and libs</font><font size=2 face="sans-serif">"</font><br><font size=2 face="sans-serif">By "offload link step" do

you mean the linking phase that happens on the device toolchain? I assume

so.</font><br><font size=2 face="sans-serif">Will the device link step just work

with "-L. -labc" even if libabc.a contains fat objects?</font><br><br><font size=2 face="sans-serif">From the above I understand that what

you want is to created your very own version of an NVLINK-like tool which

uses the clang-offload-bundler format instead of the CUDA fatbin format.

This to me means that the "unbundling" step could be included

in the device linker step after the resolution of the "-L. -labc"

options. The Clang Driver should only invoke the unbundler in the device

link step (or have the device linker invoke it internally whatever works

in your case). There is never a need to invoke it for the host part. The

Clang Driver will be calling the "bundler" to put together the

host and device objects.</font><br><font size=2 face="sans-serif">If you do all this then, at least for

your host and device toolchains, you don't need to call "ld -r"

at all. I talk below about where to use "ld -r" though.</font><br><br><br><font size=2 face="sans-serif">> "</font><font size=2 color=#004080 face="sans-serif">Clang-offload-bundler

would need to extract each device object individually while doing unbundling

operation on the partially linked object.</font><font size=2 face="sans-serif">"</font><br><font size=2 face="sans-serif">I think that as soon as you start unbundling

you need to do a full linking step of all device parts to avoid any mess.

This supports the idea that you should do the unbundling inside the device

linking step where you know exactly which parts you need to link in and

where to find them. You then output a single device-only object which you

can then bundle with the host object file.</font><br><br><font size=2 face="sans-serif">I think this will be equivalent to doing

the following:</font><br><br><font size=2 face="sans-serif"><b>A. When creating object files "clang++

-fopenmp -fopenmp-targets=intel-device-triple test.cpp -c -o test.o"

I need to call the clang-offload-bundler</b></font><br><br><font size=2 face="Courier">DEVICE TC: ---[device compilation]--->

dev-test.o </font><br><font size=2 face="Courier">           

                \-------[clang-offload-bundler

--bundle]---> test.o</font><br><font size=2 face="Courier">  HOST TC:  ---[host compilation]--->

host-test.o /</font><br><br><br><font size=2 face="sans-serif"><b>B. when doing: "clang++ -fopenmp

-fopenmp-targets=intel-device-triple test.cpp -L. -labc -o test"</b>there will be no explicit clang-offload-bundler invocations by the clang

driver. The device toolchain will look like this:</font><br><br><font size=2 face="Courier">DEVICE TC: ---[device compilation]--->

dev-test.o ---[MyDeviceLinker -L. -labc]--> device --> ....</font><br><br><font size=2 face="sans-serif">The important part is that the device

link stage will be able to seamlessly link the static lib, which would

involve the following steps:</font><br><font size=2 face="sans-serif">1. find the library using the device

linker's capabilities;</font><br><font size=2 face="sans-serif">2. unpack the library into object files;</font><br><font size=2 face="sans-serif">3. unbundle each object file to obtain

the device object;</font><br><font size=2 face="sans-serif">4. create a static library with the

device objects;</font><br><font size=2 face="sans-serif">5. link dev-test.o against the device

static library you just created.</font><br><font size=2 face="sans-serif">[I have implemented steps 2->4 before

in the clang-ykt compiler so it's definitely feasible once you know the

path to your static library, use the linker to resolve that for you]</font><br><br><br><font size=2 face="sans-serif"><b>C. when only linking is needed: "clang++

-fopenmp -fopenmp-targets=intel-device-triple test1.o test2.o -L. -labc

-o test"</b></font><br><br><font size=2 face="sans-serif">Exact same happens as above only you

need to also unbundle the individual objects and then invoke your usual

linker. The clang-offload-bundler will be called by the device link step

to "unbundle" the objects on the device.</font><br><br><br><br><br><br><font size=2 face="sans-serif">NVLINK cannot be made to work with the

clang-offload-bundler in the way that your device linker can so there will

always have to be something special being done for NVPTX targets no matter

what solution you choose.</font><br><br><font size=2 face="sans-serif">HOWEVER, there is an upside!!</font><br><br><font size=2 face="sans-serif">I suspect that the fat object produced

by the clang-offload-bundler and the object produced by the OpenMP NVPTX

toolchain (with my patch applied) can be successfully partially linked

together by "ld -r".</font><br><br><font size=2 face="sans-serif">Assuming we have something like this:</font><br><br><font size=2 face="sans-serif"><b>clang++ -fopenmp -fopenmp-targets=intel-device-triple,nvptx64-nvidia-cuda

test.cpp -c -o test.o</b></font><br><br><font size=2 face="sans-serif">then we can do this:</font><br><font size=2 face="sans-serif"> </font><br><font size=2 face="Courier">NVPTX TC:  ---[nvptx compilation]--------------------->

nvptx-test.o ------------------------\ </font><br><font size=2 face="Courier">INTEL TC: ---[intel compilation]--->

intel-test.o                  

    \</font><br><font size=2 face="Courier">           

               \ ----[clang-offload-bundler]---->

tmp.o ---[ld -r]---> test.o</font><br><font size=2 face="Courier">HOST TC:  ---[host compilation]--->

host-test.o  /</font><br><br><font size=2 face="sans-serif">Essentially we first resolve the clang-offload-bundler

bundling to obtain tmp.o for intel and other toolchains. Then we can partially

link tmp.o against the nvptx-test.o object file.</font><br><br><font size=2 face="sans-serif">Sanity checks:</font><br><font size=2 face="sans-serif">The nvptx-test.o file will be treated

by "ld -r" just like any other host file that doesn't contain

a "bundled" device section (it was not bundled with the clang-offload-bundler

but with the CUDA specific format). So if partially linking together a

host only file and a bundled file works then this should work too.</font><br><font size=2 face="sans-serif">test.o can be processed by the NVPTX

toolchain? Yes, NVLINK will know where the device side of the object is,

no unbundling required to get to the device part.</font><br><font size=2 face="sans-serif">test.o can be processed by the INTEL

toolchain? Yes, the object will be passed to the Device Linker where it

will be unbundled and device side extracted. <font size=2 face="sans-serif">static linking works for both toolchains?

Yes, previous patch enables it for NVPTX, this patch would enable it for

INTEL.</font><br><font size=2 face="sans-serif">static linking works for both toolchains

when used together? Yes, because the CUDA specific fatbinary format will

not be obstructed by the clang-offload-bundling format.</font><br><br><font size=2 face="sans-serif">Thanks,</font><br><br><font size=2 face="sans-serif">--Doru</font><br><br><br><br><font size=2 face="sans-serif"><br></font><br><br><br><br><font size=1 color=#5f5f5f face="sans-serif">From:      

 </font><font size=1 face="sans-serif">"Dmitriev, Serguei

N" <serguei.n.dmitriev@intel.com></font><br><font size=1 color=#5f5f5f face="sans-serif">To:      

 </font><font size=1 face="sans-serif">Gheorghe-Teod Bercea

<Gheorghe-Teod.Bercea@ibm.com></font><br><font size=1 color=#5f5f5f face="sans-serif">Cc:      

 </font><font size=1 face="sans-serif">"'cfe-dev@lists.llvm.org'"

<cfe-dev@lists.llvm.org>, Jonas Hahnfeld <hahnjo@hahnjo.de></font><br><font size=1 color=#5f5f5f face="sans-serif">Date:      

 </font><font size=1 face="sans-serif">08/16/2018 07:26 PM</font><br><font size=1 color=#5f5f5f face="sans-serif">Subject:    

   </font><font size=1 face="sans-serif">RE: [cfe-dev]

[OpenMP] offload support for static libraries</font><br><hr noshade><br><br><br><font size=2 color=#004080 face="sans-serif">Hi Doru,</font><br><br><font size=2 color=#004080 face="sans-serif">Thank you for the detailed

description of your changes.</font><br><font size=2 color=#004080 face="sans-serif"> </font><br><font size=2 face="Arial">> Regarding your proposal, from your slides

I understand that you perform a partial linking step as the first action

for all object files and/or static libraries given as input. So this clang

invocation:<br>> clang++ -L. -labc test.cpp -o test<br>> would result in the same compilation steps as the current Clang version

performs because the initial stage of partial linking would have no work

to do (since there are no object files present to be partially linked).</font><font size=3 face="Times New Roman"><br></font><br><font size=2 color=#004080 face="sans-serif">No, the intent is to change

offload link step to always operate with fat objects and libs, so the compilation

part of the action graph for that command should produce temporary fat

object which is then passed to the partial linking. I have not described

that in slides, but I agree that it is probably not obvious and should

have been mentioned. </font><br><font size=2 color=#004080 face="sans-serif"> </font><br><font size=2 face="Arial">> Another question I have is regarding

the "ld -r" box in your slides.<br>> How does ld -r work with "bundled" objects? Your diagram

seems to imply that ld -r does the concatenation of all device images out

of the box. Is this accurate?</font><font size=3 face="Times New Roman"><br></font><br><font size=2 color=#004080 face="sans-serif">Actually the technique

that is currently used by the clang-offload-bundler tool for creating bundled

(or fat) objects is very similar to what you are going to do for bundling

NVPTX and host objects. The difference is in the initial “wrapper” that

is created for the device code – you are using C++ structure, while clang-offload-bundler

uses LLVM bitcode file. The bundler tool creates a temporary LLVM IR which

contains a global initialized array holding the device object, and this

array is allocated in a section with predefined name (the name includes

offload target triple). This “wrapper” bitcode file is then compiled

for the host and then partially linked against the host object.</font><br><font size=2 color=#004080 face="sans-serif"> </font><br><font size=2 color=#004080 face="sans-serif">So technically fat/bundled

object is just a host ELF object which has one or more additional ELF section

containing device object as data (one extra section per each offloading

target). Linker concatenates sections with the same name while linking

multiple objects files, so the result of partial linking will have the

same named ELF sections holding device code concatenated from all input

(fat) objects.</font><br><font size=2 color=#004080 face="sans-serif"> </font><br><font size=2 color=#004080 face="sans-serif">Clang-offload-bundler

would need to extract each device object individually while doing unbundling

operation on the partially linked object. A possible way to enable this

would be creating one more section holding the device object size in addition

to existing section with device object in fat/bundled object. That would

allow clang bundler to get sizes of device objects that were concatenated

by partial linking.</font><br><font size=2 color=#004080 face="sans-serif"> </font><br><font size=2 color=#004080 face="sans-serif">Thanks,</font><br><font size=2 color=#004080 face="sans-serif">Serguei</font><br><font size=2 color=#004080 face="sans-serif"> </font><br><a name=_____replyseparator></a><font size=2 face="sans-serif"><b>From:</b>Gheorghe-Teod Bercea [</font><a href="mailto:Gheorghe-Teod.Bercea@ibm.com"><font size=2 face="sans-serif">mailto:Gheorghe-Teod.Bercea@ibm.com</font></a><font size=2 face="sans-serif">]

<b><br>Sent:</b> Thursday, August 16, 2018 6:15 AM<b><br>To:</b> Dmitriev, Serguei N <serguei.n.dmitriev@intel.com><b><br>Cc:</b> 'cfe-dev@lists.llvm.org' <cfe-dev@lists.llvm.org>; Jonas

Hahnfeld <hahnjo@hahnjo.de><b><br>Subject:</b> RE: [cfe-dev] [OpenMP] offload support for static libraries</font><br><font size=3 face="Times New Roman"> </font><br><font size=2 face="Arial">Hi Serguei,</font><font size=3 face="Times New Roman"><br></font><font size=2 face="Arial"><br>Thanks a lot for the proposal.</font><font size=3 face="Times New Roman"><br></font><font size=2 face="Arial"><br>My proposal reworks a little bit the way the OpenMP-NVPTX toolchain creates

device object files: the device specific part of the object is "wrapped"

in an NVLINK-friendly C++ structure that is then compiled for the host.

The result is a host object file with a device part which NVLINK can detect

(D.o). The D.o object file is then partially linked against the host object

file H.o and thus we obtain HD.o. This is required because compilation

is required to produce a single output object file (when doing "-c

-o" for example). HD.o can now be passed to NVLINK directly or put

in a static library and then passed to NVLINK. Either way, NVLINK will

be able to detect the device part (due to the special wrapping that we

did previously) without the need to "unbundle" the object file

(prior to passing it to NVLINK).</font><font size=3 face="Times New Roman"><br></font><font size=2 face="Arial"><br>The reason why the clang-offload-bundler is not involved in this is because

we are using the standard object format for the object file that the OpenMP-NVPTX

toolchain outputs so there's no need for a custom format in this case.

The partial linking step is required to put together the host and device

object files and to ensure that only one object file is produced even if

we actually invoked two toolchains (one for host and one for the device).</font><font size=3 face="Times New Roman"><br><br></font><font size=2 face="Arial"><br>Regarding your proposal, from your slides I understand that you perform

a partial linking step as the first action for all object files and/or

static libraries given as input. So this clang invocation:<br>clang++ -L. -labc test.cpp -o test<br>would result in the same compilation steps as the current Clang version

performs because the initial stage of partial linking would have no work

to do (since there are no object files present to be partially linked).</font><font size=3 face="Times New Roman"><br><br></font><font size=2 face="Arial"><br>Another question I have is regarding the "ld -r" box in your

slides.<br>How does ld -r work with "bundled" objects? Your diagram seems

to imply that ld -r does the concatenation of all device images out of

the box. Is this accurate?</font><font size=3 face="Times New Roman"><br><br></font><font size=2 face="Arial"><br>Thanks a lot,</font><font size=3 face="Times New Roman"><br></font><font size=2 face="Arial"><br>--Doru</font><font size=3 face="Times New Roman"><br></font><font size=2 face="Arial"><br></font><font size=3 face="Times New Roman"><br><br><br><br></font><font size=1 color=#5f5f5f face="Arial"><br>From:        </font><font size=1 face="Arial">"Dmitriev,

Serguei N" <</font><a href="mailto:serguei.n.dmitriev@intel.com"><font size=1 color=blue face="Arial"><u>serguei.n.dmitriev@intel.com</u></font></a><font size=1 face="Arial">></font><font size=1 color=#5f5f5f face="Arial"><br>To:        </font><font size=1 face="Arial">Jonas Hahnfeld

<</font><a href="mailto:hahnjo@hahnjo.de"><font size=1 color=blue face="Arial"><u>hahnjo@hahnjo.de</u></font></a><font size=1 face="Arial">></font><font size=1 color=#5f5f5f face="Arial"><br>Cc:        </font><font size=1 face="Arial">"'cfe-dev@lists.llvm.org'"

<</font><a href="mailto:cfe-dev@lists.llvm.org"><font size=1 color=blue face="Arial"><u>cfe-dev@lists.llvm.org</u></font></a><font size=1 face="Arial">>,

Doru Bercea <</font><a href="mailto:gheorghe-teod.bercea@ibm.com"><font size=1 color=blue face="Arial"><u>gheorghe-teod.bercea@ibm.com</u></font></a><font size=1 face="Arial">></font><font size=1 color=#5f5f5f face="Arial"><br>Date:        </font><font size=1 face="Arial">08/15/2018

04:27 PM</font><font size=1 color=#5f5f5f face="Arial"><br>Subject:        </font><font size=1 face="Arial">RE:

[cfe-dev] [OpenMP] offload support for static libraries</font><div align=center><hr noshade></div><br><font size=3 face="Times New Roman"><br><br></font><font size=2 face="Courier New"><br>Hi Jonas,<br><br>I guess this patch implements the proposal which Doru presented on the

"OpenMP / HPC in Clang / LLVM Multi-company" meeting. As I remember

he suggested to eliminate use of clang-offload-bundler tool when offload

target is NTVPTX by replacing bundling operation with partial linking of

host and device objects, and then relying of the NVPTX linker to perform

the unbundling operation at link phase. Based on Doru's explanations NVPTX

linker "knows" how to extract device parts from such objects,

so the explicit unbundling operation in not required. Doru, please correct

me if my understanding is not fully accurate. Doru's proposal definitely

achieves the same goal for NVPTX offloading target (i.e. enables offload

in static libraries), but it is NVPTX specific and cannot be extended to

other offloading targets (at least that is how it looked like when Doru

described it).<br><br>I propose slightly different solution which I think should work for any

generic OpenMP offload target (it was also discussed on the OpenMP multi-company

meeting). In general case we have to use clang-offload-bundler because

we cannot assume that device object(s) can be bundled with the host object

by performing partial linking of host and device objects. So bundling and

unbundling operation will still be done by the clang-offload-bundler tool.

The main part of my suggestion is adding partial linking of fat objects

(created by offload bundler tool) and static libraries (which are composed

of fat objects) and only after that do the unbundling operation on the

partially linked object (followed by the appropriate link actions for all

offloading devices and then for the host). This would guarantee that device

parts of fat objects from static libraries will participate in the device

link actions, and thus would enable offloading for static libraries.<br><br>Thanks,<br>Serguei<br><br>-----Original Message-----<br>From: Jonas Hahnfeld [</font><a href="mailto:hahnjo@hahnjo.de"><font size=2 color=blue face="Courier New"><u>mailto:hahnjo@hahnjo.de</u></font></a><font size=2 face="Courier New">]

<br>Sent: Tuesday, August 14, 2018 1:52 PM<br>To: Dmitriev, Serguei N <</font><a href="mailto:serguei.n.dmitriev@intel.com"><font size=2 color=blue face="Courier New"><u>serguei.n.dmitriev@intel.com</u></font></a><font size=2 face="Courier New">><br>Cc: 'cfe-dev@lists.llvm.org' <</font><a href="mailto:cfe-dev@lists.llvm.org"><font size=2 color=blue face="Courier New"><u>cfe-dev@lists.llvm.org</u></font></a><font size=2 face="Courier New">>;

Doru Bercea <<a href="mailto:gheorghe-teod.bercea@ibm.com"><font size=2 color=blue face="Courier New">gheorghe-teod.bercea@ibm.com</a><font size=2 face="Courier New">> Subject: Re: [cfe-dev] [OpenMP] offload support for static libraries This proposal has already been proposed for NVPTX in <a href="https://reviews.llvm.org/D47394"><font size=2 color=blue face="Courier New">https://reviews.llvm.org/D47394</a><font size=2 face="Courier New">,

adding Doru.<br><br>Cheers,<br>Jonas<br><br>On 2018-08-14 18:43, Dmitriev, Serguei N via cfe-dev wrote:<br>> PROBLEM OVERVIEW<br>> <br>> OpenMP offload functionality is currently not supported in static

<br>> libraries. Because of that an attempt to use offloading in static

<br>> libraries ends up with a fallback execution of target regions on the

<br>> host. This limitation clearly has significant impact on OpenMP offload

<br>> usability.<br>> <br>> An output object file that is created by the compiler for offload

<br>> compilation is a fat object. Such object files besides the code for

<br>> the host architecture also contains code for the offloading targets

<br>> which is stored as data in ELF sections with predefined names. Thus,

a <br>> static library that is created from object files produced by offload

<br>> compilation would be an archive of fat objects.<br>> <br>> Clang driver currently never passes fat objects directly to any <br>> toolchain. Instead it performs an unbundling operation for each fat

<br>> object which extract host and device parts from the object. These

<br>> parts are then independently processed by the corresponding target

<br>> toolchains. However, current implementation does not assume that <br>> static archives may also be composed from fat objects. No unbundling

<br>> is done for static archives (they are passed to linker as is) and

thus <br>> device parts of objects from such archives get ignored.<br>> <br>> SUGGESTED SOLUTION<br>> <br>> It seems feasible to resolve this problem by changing the offload

link <br>> process - adding an extra step to the link flow which will do a <br>> partial linking (ld -r) of fat objects and static libraries as shown

<br>> on this diagram<br>> <br>> [Fat objects] \                

                / [Target1 link]

\<br>> <br>>                [Partial linking]

- [Unbundling] - [TargetN link] - <br>> [Host link]<br>> <br>> [Static libs] /                

                \--- Host part

--/<br>> <br>> (You can also look at the .pdf file on this link <br>> </font><a href="https://drive.google.com/file/d/1ZTNoB-Ghin1BTaiZ312FMSRS6rISDtlr/view"><font size=2 color=blue face="Courier New"><u>https://drive.google.com/file/d/1ZTNoB-Ghin1BTaiZ312FMSRS6rISDtlr/view</u></font></a><font size=2 face="Courier New"><br>> ?usp=sharing [1] for illustrations for the suggested change)<br>> <br>> Linker will pull in all necessary dependencies from static libraries

<br>> while performing partial linking, so the result of partial linking

<br>> would be a fat object with concatenated device parts from input fat

<br>> objects and required dependencies from static libraries. These <br>> concatenated device objects will be stored in the corresponding ELF

<br>> sections of the partially linked object.<br>> <br>> Unbundling operation on the partially linked object will create one

or <br>> more device objects for each offloading target, and these objects

will <br>> be linked by corresponding target toolchains the same way as it is

<br>> done now. Offload bundler tool would require enhancements to support

<br>> unbundling of multiple concatenated device objects for each offloading

<br>> target.<br>> <br>> Host link action can be changed to use host part of the partially

<br>> linked object while linking the final image.<br>> <br>> Do you see any potential problems in the proposed change?<br>> <br>> Links:<br>> ------<br>> [1]<br>> </font><a href="https://drive.google.com/file/d/1ZTNoB-Ghin1BTaiZ312FMSRS6rISDtlr/view"><font size=2 color=blue face="Courier New"><u>https://drive.google.com/file/d/1ZTNoB-Ghin1BTaiZ312FMSRS6rISDtlr/view</u></font></a><font size=2 face="Courier New"><br>> ?usp=sharing _______________________________________________<br>> cfe-dev mailing list<br>> </font><a href="mailto:cfe-dev@lists.llvm.org"><font size=2 color=blue face="Courier New"><u>cfe-dev@lists.llvm.org</u></font></a><font size=2 face="Courier New"><br>> </font><a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev"><font size=2 color=blue face="Courier New"><u>http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev</u></font></a><font size=2 face="Courier New"><br></font><font size=3 face="Times New Roman"><br><br></font><br><br><BR>