[llvm-dev] [RFC] IR-level Region Annotations
Mehdi Amini via llvm-dev
llvm-dev at lists.llvm.org
Fri Jan 20 10:54:07 PST 2017
> On Jan 20, 2017, at 10:44 AM, Tian, Xinmin via llvm-dev <llvm-dev at lists.llvm.org> wrote:
>
> Sanjoy, the IR would be like something below. It is ok to hoist alloca instruction outside the region. There are some small changes in optimizer to understand region-annotation intrinsic.
>
> { void main() {
> i32* val = alloca i32
> tok = llvm.experimental.intrinsic_a()[ "DIR.PARALLEL"(), "QUAL.PRIVATE"(i32* val), "QUAL.NUM_THREADS"(i32 4)]
>
> int i = omp_get_thread_num();
> compute_something_into_val(val, i);
> a[i] = val;
>
> llvm.experimental.intrinsic_b(tok)["DIR.END.PARALLEL"()];
> }
>
> With above representation, we can do privatization and outlining as below
>
> { void main() {
> i32* val = alloca i32
> i32* I = alloca 32
> tok = llvm.experimental.intrinsic_a()[ "DIR.PARALLEL"(), "QUAL.PRIVATE"(i32* %val, i32 %i), "QUAL.NUM_THREADS"(i32 4)]
>
> %ii = omp_get_thread_num();
> compute_something_into_val(%val, %i);
> a[i] = %val;
>
> llvm.experimental.intrinsic_b(tok)["DIR.END.PARALLEL"()];
> }
>
Here we come to the interesting part: the hoisting of "i32* I = alloca 32” above the intrinsics required to update the intrinsics information “QUAL.PRIVATE”.
This means that the optimizer has to be aware of it, I’m missing the magic here?
I understand that an openmp specific optimization can do it, the question is how it an openmp agnostic supposed to behave in face of llvm.experimental.intrinsic_a?
—
Mehdi
> 1. create i32* priv_val = alloca i32 %priv_i = ...in the region, and replace all %val with %prv_val in the region.
> 2. perform outlining.
>
> Caller code
> ....
> omp_push_num_threads(4)
> omp_fork_call( .... outline_par_region....) ....
>
> Callee code:
> Outlined_par_rgion {
> I32* priv_val = alloca 32
> I32* priv_i = ....
>
> Ret
> }
>
> For OpenMP, we do support it at -O0, -O1, -O2 and -O3. We had to make sure it runs correctly w/ and w/o optimizations and advanced analysis. So we need to preserve all source information for BE.
> You can take a look our LLVM-HPC paper for a bit some details. There are still tons of work to be done. Thanks.
>
> Xinmin
>
> -----Original Message-----
> From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of Sanjoy Das via llvm-dev
> Sent: Thursday, January 19, 2017 10:13 PM
> To: Adve, Vikram Sadanand <vadve at illinois.edu>
> Cc: llvm-dev <llvm-dev at lists.llvm.org>; llvm-dev-request at lists.llvm.org
> Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations
>
> Hi Vikram,
>
> On Thu, Jan 19, 2017 at 9:27 PM, Adve, Vikram Sadanand <vadve at illinois.edu> wrote:
>> Hi Sanjoy,
>>
>> Yes, that's exactly what we have been looking at recently here, but
>> the region tags seem to make it possible to express the control flow
>> as well, so I think we could start with reg ions+metadata, as Hal and
>
> I'm not yet convinced that region tags are sufficient to model exotic control flow.
>
> (I don't know OpenMP so this is a copy-pasted-edited example)
>
> Say we have:
>
> void main() {
> #pragma omp parallel num_threads(4)
> {
> int i = omp_get_thread_num();
> int val;
> compute_something_into_val(&val, i);
> a[i] = val;
> }
> }
>
> I presume the (eventual) intended lowering is something like this (if the intended lowering is different than this, and avoids the issue I'm trying to highlight then my point is moot):
>
> void main() {
> tok = llvm.experimental.intrinsic_a();
>
> int i = omp_get_thread_num();
> i32* val = alloca i32
> compute_something_into_val(val, i);
> a[i] = val;
>
> llvm.experimental.intrinsic_b(tok);
> }
>
> However, LLVM is free to hoist the alloca to the entry block:
>
> void main() {
> i32* val = alloca i32
> tok = llvm.experimental.intrinsic_a();
>
> int i = omp_get_thread_num();
> compute_something_into_val(val, i);
> a[i] = val;
>
> llvm.experimental.intrinsic_b(tok);
> }
>
> and now you have a race between the four parallel forks.
>
> The problem here is that nothing in the IR expresses that we have four copies of the region running "at the same time". In fact, such a control flow is alien to LLVM today.
>
> For instance, another evil optimization may turn:
>
> void main() {
> int a[4];
> #pragma omp parallel num_threads(4)
> {
> int i = omp_get_thread_num();
> int val = compute_something_into_val(i);
> a[i] = val;
> }
>
> return a[0] + a[1];
> }
>
> to
>
> void main() {
> int a[4];
> #pragma omp parallel num_threads(4)
> {
> int i = omp_get_thread_num();
> int val = compute_something_into_val(i);
> a[i] = val;
> }
>
> return undef;
> }
>
> since a[i] = val could have initialized at most one element in a.
>
> Now you could say that the llvm.experimental.intrinsic_a and llvm.experimental.intrinsic_b intrinsics are magic, and even such "obvious" optimizations are not allowed to happen across them; but then calls to these intrinsics is pretty fundamentally different from "normal" calls, and are probably best modeled as new instructions.
> You're going to have to do the same kind of auditing of passes either way, and the only extra cost of a new instruction is the extra bitcode reading / writing code.
>
> I hope I made sense.
>
> -- Sanjoy
>
>> Xinmin proposed, and then figure out what needs to be first class
>> instructions.
>
>>
>> --Vikram Adve
>>
>>
>>
>>> On Jan 19, 2017, at 11:03 PM, Sanjoy Das <sanjoy at playingwithpointers.com> wrote:
>>>
>>> Hi,
>>>
>>> My bias is to use both (b) and (d), since they have complementary
>>> strengths. We should use (b) for expressing concepts that can't be
>>> semantically modeled as a call or invoke (this branch takes both its
>>> successors), and (d) for expressing things that can be (this call may
>>> never return), and annotation like things (this region (denoted by
>>> def-use of a token) is a reduction).
>>>
>>> I don't grok OpenMP, but perhaps we can come with one or two
>>> "generalized control flow"-type instructions that can be used to
>>> model the non-call/invoke like semantics we'd like LLVM to know
>>> about, and model the rest with (d)?
>>>
>>> -- Sanjoy
>>>
>>> On Thu, Jan 19, 2017 at 8:28 PM, Hal Finkel via llvm-dev
>>> <llvm-dev at lists.llvm.org> wrote:
>>>>
>>>> On 01/19/2017 03:36 PM, Mehdi Amini via llvm-dev wrote:
>>>>
>>>>
>>>> On Jan 19, 2017, at 1:32 PM, Daniel Berlin <dberlin at dberlin.org> wrote:
>>>>
>>>>
>>>>
>>>>> On Thu, Jan 19, 2017 at 1:12 PM, Mehdi Amini <mehdi.amini at apple.com> wrote:
>>>>>
>>>>>
>>>>> On Jan 19, 2017, at 12:04 PM, Daniel Berlin <dberlin at dberlin.org> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Jan 19, 2017 at 11:46 AM, Mehdi Amini via llvm-dev
>>>>> <llvm-dev at lists.llvm.org> wrote:
>>>>>>
>>>>>>
>>>>>>> On Jan 19, 2017, at 11:36 AM, Adve, Vikram Sadanand via llvm-dev
>>>>>>> <llvm-dev at lists.llvm.org> wrote:
>>>>>>>
>>>>>>> Hi Johannes,
>>>>>>>
>>>>>>>> I am especially curious where you get your data from. Tapir [0]
>>>>>>>> (and to some degree PIR [1]) have shown that,
>>>>>>>> counterintuitively, only a few changes to LLVM passes are
>>>>>>>> needed. Tapir was recently used in an MIT class with a lot of
>>>>>>>> students and it seemed to work well with only minimal changes to
>>>>>>>> analysis and especially transformation passes.
>>>>>>>
>>>>>>> TAPIR is an elegant, small extension and, in particular, I think
>>>>>>> the idea of asymmetric parallel tasks and control flow is a
>>>>>>> clever way to express parallelism with serial semantics, as in
>>>>>>> Cilk. Encoding the control flow extensions as explicit
>>>>>>> instructions is orthogonal to that, though arguably more elegant than using region tags + metadata.
>>>>>>>
>>>>>>> However, Cilk is a tiny language compared with the full
>>>>>>> complexity of other languages, like OpenMP. To take just one
>>>>>>> example, TAPIR cannot express the ORDERED construct of OpenMP. A
>>>>>>> more serious concern, IMO, is that TAPIR (like Cilk) requires
>>>>>>> serial semantics, whereas there are many parallel languages, OpenMP included, that do not obey that restriction.
>>>>>>> Third, OpenMP has *numerous* clauses, e.g., REDUCTION or PRIVATE,
>>>>>>> that are needed because without that, you’d be dependent on
>>>>>>> fundamentally hard compiler analyses to extract the same
>>>>>>> information for satisfactory parallel performance; realistic
>>>>>>> applications cannot depend on the success of such analyses.
>>>>>>
>>>>>> I agree with this, but I’m also wondering if it needs to be first
>>>>>> class in the IR?
>>>>>> For example we know our alias analysis is very basic, and C/C++
>>>>>> have a higher constraint thanks to their type system, but we
>>>>>> didn’t inject this higher level information that helps the
>>>>>> optimizer as first class IR constructs.
>>>>>
>>>>>
>>>>> FWIW, while i agree with the general point, i wouldn't use this example.
>>>>> Because we pretty much still suffer to this day because of it (both
>>>>> in AA, and devirt, and ...) :) We can't always even tell fields
>>>>> apart
>>>>>
>>>>>
>>>>> Is it inherent to the infrastructure, i.e. using metadata instead
>>>>> of first class IR construct or is it just a “quality of implementation” issue?
>>>>
>>>> Not to derail this conversation:
>>>>
>>>> IMHO, At some point there is no real difference :)
>>>>
>>>> Because otherwise, everything is a QOI issue.
>>>>
>>>> IE if it's super tricky to get metadata that works well and works
>>>> right, doesn't get lost, etc, and that's inherent to using metadata,
>>>> that to me is not a QOI issue.
>>>>
>>>> So could it be done with metadata? Probably?
>>>> But at the same time, if it had been done with more first class
>>>> constructs, it would have happened years ago and been much lower cost.
>>>>
>>>>
>>>> This is what I meant by “inherent to the infrastructure”, thanks for
>>>> clarifying.
>>>>
>>>>
>>>> To clarify, we were proposing metadata that is used as arguments to
>>>> the region-annotation intrinsics. This metadata has the nice
>>>> property that it does not get dropped (so it is just being used as a
>>>> way of encoding whatever data structures are necessary without predefining a syntactic schema).
>>>>
>>>> -Hal
>>>>
>>>>
>>>> —
>>>> Mehdi
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> llvm-dev at lists.llvm.org
>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>
>>>>
>>>> --
>>>> Hal Finkel
>>>> Lead, Compiler Technology and Programming Languages Leadership
>>>> Computing Facility Argonne National Laboratory
>>>>
>>>>
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> llvm-dev at lists.llvm.org
>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
More information about the llvm-dev
mailing list