[Openmp-dev] [RFC] Late (OpenMP) GPU code "SPMD-zation"

Wed Mar 13 14:55:37 PDT 2019

Maybe it helps that I (or better Jose Diaz) run the OpenMP target offloading V&V test suite [1] with the new code generation enabled.

All but one test did pass, I'm looking into the remaining problem right now.

[1] https://crpl.cis.udel.edu/ompvvsollve/

________________________________
From: Alexey Bataev <a.bataev at hotmail.com>
Sent: Wednesday, March 13, 2019 4:33:03 PM
To: Doerfert, Johannes
Cc: Alexey Bataev; cfe-dev at lists.llvm.org; openmp-dev at lists.llvm.org; llvm-dev; Finkel, Hal J.
Subject: Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"

1. You don't need to implement everything in a single patch. The development process is a step-by-step process, when you commit something in small pieces. The code must nit be fully functional, you may start from some basic features. Currently it is very hard to review.
2. I rather doubt that it can be reused without changes for AMD etc., especially without being fully tested. The only tested target is NVPTX and at first we need to support it. Later, we could extend it to AMD and some other targets.
3. No, it is not incidental. It is thoroughly tested, at least.
4. Hm, if it would be so, I would just ignored it. Yes, I'm a bit sceptical, but this is normal. It is the fact that these patches break Coding standard, which suggests to split patches into small pieces and commit them one by one.

Best regards,
Alexey Bataev

> 13 марта 2019 г., в 17:18, Doerfert, Johannes <jdoerfert at anl.gov> написал(а):
>
>> On 03/13, Alexey Bataev wrote:
>> 13.03.2019 15:35, Doerfert, Johannes пишет:
>>>
>>> Hi Alexey,
>>>
>>>
>>> thank you for your quick feedback.
>>>
>>>
>>>> There are tooooooo(!) many changes, I don't who's going to review sooooo big
>>> patch.
>>>
>>>
>>> I can for sure split it in the three components/repositories that are
>>> touched, clang, llvm, and openmp.
>>>
>>> I feared it will then be harder to navigate the code in order to see
>>> the connection points.
>>>
>>> I am a bit amazed by your hyperbolism though given the complexity is
>>> not that height
>>>
>>> due to the absence of modified or removed lines. Anyway, you seem to
>>> have very strong feelings about
>>>
>>> this so I am open to suggestion on how to split it up.
>>>
>>>
>>
>> 1. You definitely need to split it into separate patches for different
>> components.
>
> Done:
>  OpenMP: https://reviews.llvm.org/D59319
>   Clang: https://reviews.llvm.org/D59328
>    LLVM: https://reviews.llvm.org/D59331
>
>> 2. Even inside of those components this patch must be split into several
>> small patches, it is very hard to review so big patches.
>
> Please take a look at the three patches above.
>
> The first contains the interface definition and implementation for NVPTX
> (in cuda). I don't know how to further split that except to separate it
> into the definition and the implementation, though that does not make
> sense to me.
>
> The second contains the code generation. It is very much like the NVPTX
> code generation except that it does not contain logic.
>
> The third is the LLVM pass which could be split into two, SPMD-mode and
> state machine creation. I'll wait for feedback on the other patches
> until I go ahead.
>
>
>>>> Also, I don't like the idea adding of one more class for NVPTX
>>> codegen. All your changes should be on top of the eixisting solution.
>>>
>>>
>>> Could you please explain to me why? This will only make everything
>>> more complicated and entangled.
>>> Also, the new class is supposed to be "target agnostic" so a new
>>> offloading target, e.g., AMD GPUs, could easily reuse
>>> the new code while the old code is sprinkled with NVPTX specific
>>> details, e.g., function calls, constants, etc.
>>>
>> 1. As far as I know, even now the NVPTX codegen can be reused for AMD
>> GPUs with some small changes.
>
> The target region code generation is supposed to be reusable for
> AMD/XYZ/... without changes.
>
>
>> 2. Your patch is about codegen for NVPTX, so you must change the
>> existing codegen, but not to introduce the new one for the same target.
>
> I strongly disagree. The patch is not "for NVPTX" but for "OpenMP target
> offloading", maybe with a focus on "GPU kernels". The fact that the only
> target offloading device we currently support is based on Cuda and NVPTX
> is incidental.
>
>
>> There is no point to maintain two different codegens for one target.
>
> Given your comments on my initial RFC and prototype I strongly suspected
> you do not want this approach to replace the current NVPTX code
> generation. Once that changes we can get rid of one of them.
>
>
>>>
>>> Thanks again,
>>>   Johannes
>>> ------------------------------------------------------------------------
>>> *From:* Alexey Bataev <a.bataev at outlook.com>
>>> *Sent:* Wednesday, March 13, 2019 2:15:39 PM
>>> *To:* Doerfert, Johannes; cfe-dev at lists.llvm.org
>>> *Cc:* openmp-dev at lists.llvm.org; LLVM-Dev; Finkel, Hal J.; Alexey
>>> Bataev; Arpith Chacko Jacob
>>> *Subject:* Re: [RFC] Late (OpenMP) GPU code "SPMD-zation"
>>>
>>>
>>> There are tooooooo(!) many changes, I don't who's going to review
>>> sooooo big patch. You definitely need to split it into several smaller
>>> patches. Also, I don't like the idea adding of one more class for
>>> NVPTX codegen. All your changes should be on top of the eixisting
>>> solution.
>>>
>>> -------------
>>> Best regards,
>>> Alexey Bataev
>>> 13.03.2019 15:08, Doerfert, Johannes пишет:
>>>> Please consider reviewing the code for the proposed approach here:
>>>>  https://reviews.llvm.org/D57460
>>>>
>>>> Initial tests, e.g., on the nw (needleman-wunsch) benchmark in the
>>>> rodinia 3.1 benchmark suite, showed 30% improvement after SPMD mode was
>>>> enabled automatically. The code in nw is conceptually equivalent to the
>>>> first example in the "to_SPMD_mode.ll" test case that can be found here:
>>>>  https://reviews.llvm.org/D57460#change-sBfg7kuN4Bid
>>>>
>>>> The implementation is missing key features but one should be able to see
>>>> the overall design by now. Once accepted, the missing features and more
>>>> optimizations will be added.
>>>>
>>>>
>>>>> On 01/22, Johannes Doerfert wrote:
>>>>> Where we are
>>>>> ------------
>>>>>
>>>>> Currently, when we generate OpenMP target offloading code for GPUs, we
>>>>> use sufficient syntactic criteria to decide between two execution modes:
>>>>>  1)      SPMD -- All target threads (in an OpenMP team) run all the code.
>>>>>  2) "Guarded" -- The master thread (of an OpenMP team) runs the user
>>>>>                  code. If an OpenMP distribute region is encountered, thus
>>>>>                  if all threads (in the OpenMP team) are supposed to
>>>>>                  execute the region, the master wakes up the idling
>>>>>                  worker threads and points them to the correct piece of
>>>>>                  code for distributed execution.
>>>>>
>>>>> For a variety of reasons we (generally) prefer the first execution mode.
>>>>> However, depending on the code, that might not be valid, or we might
>>>>> just not know if it is in the Clang code generation phase.
>>>>>
>>>>> The implementation of the "guarded" execution mode follows roughly the
>>>>> state machine description in [1], though the implementation is different
>>>>> (more general) nowadays.
>>>>>
>>>>>
>>>>> What we want
>>>>> ------------
>>>>>
>>>>> Increase the amount of code executed in SPMD mode and the use of
>>>>> lightweight "guarding" schemes where appropriate.
>>>>>
>>>>>
>>>>> How we get (could) there
>>>>> ------------------------
>>>>>
>>>>> We propose the following two modifications in order:
>>>>>
>>>>>  1) Move the state machine logic into the OpenMP runtime library. That
>>>>>     means in SPMD mode all device threads will start the execution of
>>>>>     the user code, thus emerge from the runtime, while in guarded mode
>>>>>     only the master will escape the runtime and the other threads will
>>>>>     idle in their state machine code that is now just "hidden".
>>>>>
>>>>>     Why:
>>>>>     - The state machine code cannot be (reasonably) optimized anyway,
>>>>>       moving it into the library shouldn't hurt runtime but might even
>>>>>       improve compile time a little bit.
>>>>>     - The change should also simplify the Clang code generation as we
>>>>>       would generate structurally the same code for both execution modes
>>>>>       but only the runtime library calls, or their arguments, would
>>>>>       differ between them.
>>>>>     - The reason we should not "just start in SPMD mode" and "repair"
>>>>>       it later is simple, this way we always have semantically correct
>>>>>       and executable code.
>>>>>     - Finally, and most importantly, there is now only little
>>>>>       difference (see above) between the two modes in the code
>>>>>       generated by clang. If we later analyze the code trying to decide
>>>>>       if we can use SPMD mode instead of guarded mode the analysis and
>>>>>       transformation becomes much simpler.
>>>>>
>>>>> 2) Implement a middle-end LLVM-IR pass that detects the guarded mode,
>>>>>    e.g., through the runtime library calls used, and that tries to
>>>>>    convert it into the SPMD mode potentially by introducing lightweight
>>>>>    guards in the process.
>>>>>
>>>>>    Why:
>>>>>    - After the inliner, and the canonicalizations, we have a clearer
>>>>>      picture of the code that is actually executed in the target
>>>>>      region and all the side effects it contains. Thus, we can make an
>>>>>      educated decision on the required amount of guards that prevent
>>>>>      unwanted side effects from happening after a move to SPMD mode.
>>>>>    - At this point we can more easily introduce different schemes to
>>>>>      avoid side effects by threads that were not supposed to run. We
>>>>>      can decide if a state machine is needed, conditionals should be
>>>>>      employed, masked instructions are appropriate, or "dummy" local
>>>>>      storage can be used to hide the side effect from the outside
>>>>>      world.
>>>>>
>>>>>
>>>>> None of this was implemented yet but we plan to start in the immediate
>>>>> future. Any comments, ideas, criticism is welcome!
>>>>>
>>>>>
>>>>> Cheers,
>>>>>  Johannes
>>>>>
>>>>>
>>>>> P.S. [2-4] Provide further information on implementation and features.
>>>>>
>>>>> [1] https://ieeexplore.ieee.org/document/7069297
>>>>> [2] https://dl.acm.org/citation.cfm?id=2833161
>>>>> [3] https://dl.acm.org/citation.cfm?id=3018870
>>>>> [4] https://dl.acm.org/citation.cfm?id=3148189
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Johannes Doerfert
>>>>> Researcher
>>>>>
>>>>> Argonne National Laboratory
>>>>> Lemont, IL 60439, USA
>>>>>
>>>>> jdoerfert at anl.gov <mailto:jdoerfert at anl.gov>
>
>
>
>
> --
>
> Johannes Doerfert
> Researcher
>
> Argonne National Laboratory
> Lemont, IL 60439, USA
>
> jdoerfert at anl.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/openmp-dev/attachments/20190313/faa806d0/attachment-0001.html>