[llvm-dev] [RFC] IR-level Region Annotations
Mehdi Amini via llvm-dev
llvm-dev at lists.llvm.org
Sat Jan 21 14:04:49 PST 2017
I missed this before answering my last email about how "as is” it does not sound like something that can scale non-intrusively in LLVM as a model-agnostic and general-purpose region semantic.
Sanjoy explains it just better :), so right now my understanding is the same as what Sanjoy expressed below.
—
Mehdi
> On Jan 20, 2017, at 11:35 AM, Sanjoy Das <sanjoy at playingwithpointers.com> wrote:
>
> Hi,
>
> I'm going to club together some responses.
>
> I agree that outlining function sub-bodies and passing in the function
> pointers to said outlined bodies to OpenMP helpers lets us correctly
> implement the semantics we need. However, unless I severely
> misunderstood the thread, I thought the key idea was to move *away*
> from that representation and towards a representation that _allows_
> optimization?
>
> My problem with representing parallel regions with
> intrinsic-denoted-regions is that we're lying to the optimizer about
> what the code actually does. Calls, even to intrinsics, can "at
> worst" do some combination of the following:
>
> - Write to and read from arbitrary memory
> - Have UB (but we're allowed to pretend that they don't)
> - Throw an exception
> - Never return, either by infinite looping or by calling exit(0)
> - Have memory synchronization operations, like fences, atomic loads,
> stores etc.
> - Have side effects like IO, volatile writes
>
> If an intrinsic's behavior can be explained by some subset of the
> above, then you should not need to edit any pass to preserve
> _correctness_ -- all optimization passes (today) conservatively assume
> that calls that they don't understand have all of the behaviors
> outlined above.
>
> However, if to preserve *correctness* you have to edit optimization
> passes and teach them that certain intrinsic calls have behavior
> *outside* the set mentioned above then the instruction really does not
> have "call semantics". `call @llvm.experimental.region_begin()` is
> really a fundamentally new instruction masquerading as an intrinsic,
> and it is probably better to call a spade a spade and represent it as
> a new instruction.
>
> The setting for the examples I gave was not that "here is a case we
> need to get right". The setting was that "here is a *symptom* that
> shows that we've lied to the optimizer". We can go ahead and fix all
> the symptoms by adding bailouts to the respective passes, but that
> does not make us immune to passes that we don't know about
> e.g. downstream passes, passes that will be added later. It also puts
> us in a weird spot around semantics of call instructions.
>
> -- Sanjoy
>
> On Fri, Jan 20, 2017 at 11:22 AM, Tian, Xinmin via llvm-dev
> <llvm-dev at lists.llvm.org> wrote:
>> Yonghong, In our implementation (not open sourced), we don’t do outlining
>> the Front-End. See my previous reply to Medhi’s email.
>>
>>
>>
>> Xinmin
>>
>>
>>
>> From: Yonghong Yan [mailto:yan at oakland.edu]
>> Sent: Friday, January 20, 2017 11:18 AM
>> To: Mehdi Amini
>> Cc: Tian, Xinmin; llvm-dev at lists.llvm.org; llvm-dev-request at lists.llvm.org;
>> Adve, Vikram Sadanand
>> Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations
>>
>>
>>
>> Xinmin,
>>
>>
>>
>> outlining turns a parallel program into a sequential one from compiler's
>> perspective, and that is why most of the parallel-ignorant pass would hurt.
>> In your IR description for Sanjoy's example, does the current approach of
>> outlining impacting the way of the IR should be enhanced for parallelism?
>>
>>
>>
>> For that specific example (or other analysis and optimization SPMD) and what
>> is implemented in clang, I am not sure whether we are going to change the
>> frontend so not to outline the parallel region, or allow to perform certain
>> optimization such as hoisting that alloca in clang which is not desired I
>> believe. Or annotate the outlined function together with the intrinsic_a so
>> that hoisting can be performed, in which case the instrisic_a would like
>> this:
>>
>>
>>
>> tok = llvm.experimental.intrinsic_a()[ "DIR.PARALLEL"(), "QUAL.PRIVATE"(i32*
>> %val, i32 %i), "QUAL.NUM_THREADS"(i32 4) plus info for OUTLINED_call.
>>
>>
>>
>> Mehdi,
>>
>>
>>
>> I think i am asking the same question as you asked me.
>>
>>
>>
>> Yonghong
>>
>>
>>
>>
>>
>>
>>
>> On Fri, Jan 20, 2017 at 1:54 PM, Mehdi Amini via llvm-dev
>> <llvm-dev at lists.llvm.org> wrote:
>>
>>
>>> On Jan 20, 2017, at 10:44 AM, Tian, Xinmin via llvm-dev
>>> <llvm-dev at lists.llvm.org> wrote:
>>>
>>> Sanjoy, the IR would be like something below. It is ok to hoist alloca
>>> instruction outside the region. There are some small changes in optimizer to
>>> understand region-annotation intrinsic.
>>>
>>> { void main() {
>>> i32* val = alloca i32
>>> tok = llvm.experimental.intrinsic_a()[ "DIR.PARALLEL"(),
>>> "QUAL.PRIVATE"(i32* val), "QUAL.NUM_THREADS"(i32 4)]
>>>
>>> int i = omp_get_thread_num();
>>> compute_something_into_val(val, i);
>>> a[i] = val;
>>>
>>> llvm.experimental.intrinsic_b(tok)["DIR.END.PARALLEL"()];
>>> }
>>>
>>> With above representation, we can do privatization and outlining as below
>>>
>>> { void main() {
>>> i32* val = alloca i32
>>> i32* I = alloca 32
>>> tok = llvm.experimental.intrinsic_a()[ "DIR.PARALLEL"(),
>>> "QUAL.PRIVATE"(i32* %val, i32 %i), "QUAL.NUM_THREADS"(i32 4)]
>>>
>>> %ii = omp_get_thread_num();
>>> compute_something_into_val(%val, %i);
>>> a[i] = %val;
>>>
>>> llvm.experimental.intrinsic_b(tok)["DIR.END.PARALLEL"()];
>>> }
>>>
>>
>> Here we come to the interesting part: the hoisting of "i32* I = alloca 32”
>> above the intrinsics required to update the intrinsics information
>> “QUAL.PRIVATE”.
>> This means that the optimizer has to be aware of it, I’m missing the magic
>> here?
>> I understand that an openmp specific optimization can do it, the question is
>> how it an openmp agnostic supposed to behave in face of
>> llvm.experimental.intrinsic_a?
>>
>> —
>> Mehdi
>>
>>
>>
>>
>>> 1. create i32* priv_val = alloca i32 %priv_i = ...in the region, and
>>> replace all %val with %prv_val in the region.
>>> 2. perform outlining.
>>>
>>> Caller code
>>> ....
>>> omp_push_num_threads(4)
>>> omp_fork_call( .... outline_par_region....) ....
>>>
>>> Callee code:
>>> Outlined_par_rgion {
>>> I32* priv_val = alloca 32
>>> I32* priv_i = ....
>>>
>>> Ret
>>> }
>>>
>>> For OpenMP, we do support it at -O0, -O1, -O2 and -O3. We had to make
>>> sure it runs correctly w/ and w/o optimizations and advanced analysis. So we
>>> need to preserve all source information for BE.
>>> You can take a look our LLVM-HPC paper for a bit some details. There are
>>> still tons of work to be done. Thanks.
>>>
>>> Xinmin
>>>
>>> -----Original Message-----
>>> From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of
>>> Sanjoy Das via llvm-dev
>>> Sent: Thursday, January 19, 2017 10:13 PM
>>> To: Adve, Vikram Sadanand <vadve at illinois.edu>
>>> Cc: llvm-dev <llvm-dev at lists.llvm.org>; llvm-dev-request at lists.llvm.org
>>> Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations
>>>
>>> Hi Vikram,
>>>
>>> On Thu, Jan 19, 2017 at 9:27 PM, Adve, Vikram Sadanand
>>> <vadve at illinois.edu> wrote:
>>>> Hi Sanjoy,
>>>>
>>>> Yes, that's exactly what we have been looking at recently here, but
>>>> the region tags seem to make it possible to express the control flow
>>>> as well, so I think we could start with reg ions+metadata, as Hal and
>>>
>>> I'm not yet convinced that region tags are sufficient to model exotic
>>> control flow.
>>>
>>> (I don't know OpenMP so this is a copy-pasted-edited example)
>>>
>>> Say we have:
>>>
>>> void main() {
>>> #pragma omp parallel num_threads(4)
>>> {
>>> int i = omp_get_thread_num();
>>> int val;
>>> compute_something_into_val(&val, i);
>>> a[i] = val;
>>> }
>>> }
>>>
>>> I presume the (eventual) intended lowering is something like this (if the
>>> intended lowering is different than this, and avoids the issue I'm trying to
>>> highlight then my point is moot):
>>>
>>> void main() {
>>> tok = llvm.experimental.intrinsic_a();
>>>
>>> int i = omp_get_thread_num();
>>> i32* val = alloca i32
>>> compute_something_into_val(val, i);
>>> a[i] = val;
>>>
>>> llvm.experimental.intrinsic_b(tok);
>>> }
>>>
>>> However, LLVM is free to hoist the alloca to the entry block:
>>>
>>> void main() {
>>> i32* val = alloca i32
>>> tok = llvm.experimental.intrinsic_a();
>>>
>>> int i = omp_get_thread_num();
>>> compute_something_into_val(val, i);
>>> a[i] = val;
>>>
>>> llvm.experimental.intrinsic_b(tok);
>>> }
>>>
>>> and now you have a race between the four parallel forks.
>>>
>>> The problem here is that nothing in the IR expresses that we have four
>>> copies of the region running "at the same time". In fact, such a control
>>> flow is alien to LLVM today.
>>>
>>> For instance, another evil optimization may turn:
>>>
>>> void main() {
>>> int a[4];
>>> #pragma omp parallel num_threads(4)
>>> {
>>> int i = omp_get_thread_num();
>>> int val = compute_something_into_val(i);
>>> a[i] = val;
>>> }
>>>
>>> return a[0] + a[1];
>>> }
>>>
>>> to
>>>
>>> void main() {
>>> int a[4];
>>> #pragma omp parallel num_threads(4)
>>> {
>>> int i = omp_get_thread_num();
>>> int val = compute_something_into_val(i);
>>> a[i] = val;
>>> }
>>>
>>> return undef;
>>> }
>>>
>>> since a[i] = val could have initialized at most one element in a.
>>>
>>> Now you could say that the llvm.experimental.intrinsic_a and
>>> llvm.experimental.intrinsic_b intrinsics are magic, and even such "obvious"
>>> optimizations are not allowed to happen across them; but then calls to these
>>> intrinsics is pretty fundamentally different from "normal" calls, and are
>>> probably best modeled as new instructions.
>>> You're going to have to do the same kind of auditing of passes either way,
>>> and the only extra cost of a new instruction is the extra bitcode reading /
>>> writing code.
>>>
>>> I hope I made sense.
>>>
>>> -- Sanjoy
>>>
>>>> Xinmin proposed, and then figure out what needs to be first class
>>>> instructions.
>>>
>>>>
>>>> --Vikram Adve
>>>>
>>>>
>>>>
>>>>> On Jan 19, 2017, at 11:03 PM, Sanjoy Das
>>>>> <sanjoy at playingwithpointers.com> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> My bias is to use both (b) and (d), since they have complementary
>>>>> strengths. We should use (b) for expressing concepts that can't be
>>>>> semantically modeled as a call or invoke (this branch takes both its
>>>>> successors), and (d) for expressing things that can be (this call may
>>>>> never return), and annotation like things (this region (denoted by
>>>>> def-use of a token) is a reduction).
>>>>>
>>>>> I don't grok OpenMP, but perhaps we can come with one or two
>>>>> "generalized control flow"-type instructions that can be used to
>>>>> model the non-call/invoke like semantics we'd like LLVM to know
>>>>> about, and model the rest with (d)?
>>>>>
>>>>> -- Sanjoy
>>>>>
>>>>> On Thu, Jan 19, 2017 at 8:28 PM, Hal Finkel via llvm-dev
>>>>> <llvm-dev at lists.llvm.org> wrote:
>>>>>>
>>>>>> On 01/19/2017 03:36 PM, Mehdi Amini via llvm-dev wrote:
>>>>>>
>>>>>>
>>>>>> On Jan 19, 2017, at 1:32 PM, Daniel Berlin <dberlin at dberlin.org> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>> On Thu, Jan 19, 2017 at 1:12 PM, Mehdi Amini <mehdi.amini at apple.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>> On Jan 19, 2017, at 12:04 PM, Daniel Berlin <dberlin at dberlin.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jan 19, 2017 at 11:46 AM, Mehdi Amini via llvm-dev
>>>>>>> <llvm-dev at lists.llvm.org> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>> On Jan 19, 2017, at 11:36 AM, Adve, Vikram Sadanand via llvm-dev
>>>>>>>>> <llvm-dev at lists.llvm.org> wrote:
>>>>>>>>>
>>>>>>>>> Hi Johannes,
>>>>>>>>>
>>>>>>>>>> I am especially curious where you get your data from. Tapir [0]
>>>>>>>>>> (and to some degree PIR [1]) have shown that,
>>>>>>>>>> counterintuitively, only a few changes to LLVM passes are
>>>>>>>>>> needed. Tapir was recently used in an MIT class with a lot of
>>>>>>>>>> students and it seemed to work well with only minimal changes to
>>>>>>>>>> analysis and especially transformation passes.
>>>>>>>>>
>>>>>>>>> TAPIR is an elegant, small extension and, in particular, I think
>>>>>>>>> the idea of asymmetric parallel tasks and control flow is a
>>>>>>>>> clever way to express parallelism with serial semantics, as in
>>>>>>>>> Cilk. Encoding the control flow extensions as explicit
>>>>>>>>> instructions is orthogonal to that, though arguably more elegant
>>>>>>>>> than using region tags + metadata.
>>>>>>>>>
>>>>>>>>> However, Cilk is a tiny language compared with the full
>>>>>>>>> complexity of other languages, like OpenMP. To take just one
>>>>>>>>> example, TAPIR cannot express the ORDERED construct of OpenMP. A
>>>>>>>>> more serious concern, IMO, is that TAPIR (like Cilk) requires
>>>>>>>>> serial semantics, whereas there are many parallel languages, OpenMP
>>>>>>>>> included, that do not obey that restriction.
>>>>>>>>> Third, OpenMP has *numerous* clauses, e.g., REDUCTION or PRIVATE,
>>>>>>>>> that are needed because without that, you’d be dependent on
>>>>>>>>> fundamentally hard compiler analyses to extract the same
>>>>>>>>> information for satisfactory parallel performance; realistic
>>>>>>>>> applications cannot depend on the success of such analyses.
>>>>>>>>
>>>>>>>> I agree with this, but I’m also wondering if it needs to be first
>>>>>>>> class in the IR?
>>>>>>>> For example we know our alias analysis is very basic, and C/C++
>>>>>>>> have a higher constraint thanks to their type system, but we
>>>>>>>> didn’t inject this higher level information that helps the
>>>>>>>> optimizer as first class IR constructs.
>>>>>>>
>>>>>>>
>>>>>>> FWIW, while i agree with the general point, i wouldn't use this
>>>>>>> example.
>>>>>>> Because we pretty much still suffer to this day because of it (both
>>>>>>> in AA, and devirt, and ...) :) We can't always even tell fields
>>>>>>> apart
>>>>>>>
>>>>>>>
>>>>>>> Is it inherent to the infrastructure, i.e. using metadata instead
>>>>>>> of first class IR construct or is it just a “quality of
>>>>>>> implementation” issue?
>>>>>>
>>>>>> Not to derail this conversation:
>>>>>>
>>>>>> IMHO, At some point there is no real difference :)
>>>>>>
>>>>>> Because otherwise, everything is a QOI issue.
>>>>>>
>>>>>> IE if it's super tricky to get metadata that works well and works
>>>>>> right, doesn't get lost, etc, and that's inherent to using metadata,
>>>>>> that to me is not a QOI issue.
>>>>>>
>>>>>> So could it be done with metadata? Probably?
>>>>>> But at the same time, if it had been done with more first class
>>>>>> constructs, it would have happened years ago and been much lower cost.
>>>>>>
>>>>>>
>>>>>> This is what I meant by “inherent to the infrastructure”, thanks for
>>>>>> clarifying.
>>>>>>
>>>>>>
>>>>>> To clarify, we were proposing metadata that is used as arguments to
>>>>>> the region-annotation intrinsics. This metadata has the nice
>>>>>> property that it does not get dropped (so it is just being used as a
>>>>>> way of encoding whatever data structures are necessary without
>>>>>> predefining a syntactic schema).
>>>>>>
>>>>>> -Hal
>>>>>>
>>>>>>
>>>>>> —
>>>>>> Mehdi
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> LLVM Developers mailing list
>>>>>> llvm-dev at lists.llvm.org
>>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Hal Finkel
>>>>>> Lead, Compiler Technology and Programming Languages Leadership
>>>>>> Computing Facility Argonne National Laboratory
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> LLVM Developers mailing list
>>>>>> llvm-dev at lists.llvm.org
>>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
>>
>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
More information about the llvm-dev
mailing list