[llvm-dev] [RFC] IR-level Region Annotations

Fri Jan 20 10:54:07 PST 2017

> On Jan 20, 2017, at 10:44 AM, Tian, Xinmin via llvm-dev <llvm-dev at lists.llvm.org> wrote:
> 
> Sanjoy, the IR would be like something below. It is ok to hoist alloca instruction outside the region. There are some small changes in optimizer to understand region-annotation intrinsic.   
> 
> { void main() {
>  i32* val = alloca i32
>  tok = llvm.experimental.intrinsic_a()[ "DIR.PARALLEL"(), "QUAL.PRIVATE"(i32* val), "QUAL.NUM_THREADS"(i32 4)]
> 
>  int i = omp_get_thread_num();
>  compute_something_into_val(val, i);
>  a[i] = val;
> 
>  llvm.experimental.intrinsic_b(tok)["DIR.END.PARALLEL"()];
> }
> 
> With above representation, we can do privatization and outlining as below
> 
> { void main() {
>  i32* val = alloca i32
>  i32* I = alloca 32
>  tok = llvm.experimental.intrinsic_a()[ "DIR.PARALLEL"(), "QUAL.PRIVATE"(i32* %val, i32 %i), "QUAL.NUM_THREADS"(i32 4)]
> 
>  %ii = omp_get_thread_num();
>  compute_something_into_val(%val, %i);
>  a[i] = %val;
> 
>  llvm.experimental.intrinsic_b(tok)["DIR.END.PARALLEL"()];
> }
> 

Here we come to the interesting part: the hoisting of  "i32* I = alloca 32” above the intrinsics required to update the intrinsics information “QUAL.PRIVATE”. 
This means that the optimizer has to be aware of it, I’m missing the magic here?
I understand that an openmp specific optimization can do it, the question is how it an openmp agnostic supposed to behave in face of llvm.experimental.intrinsic_a?

— 
Mehdi

> 1. create i32* priv_val = alloca i32  %priv_i = ...in the region, and replace all  %val with %prv_val in the region. 
> 2. perform outlining. 
> 
> Caller code
> ....
> omp_push_num_threads(4)
> omp_fork_call( .... outline_par_region....) ....
> 
> Callee code: 
> Outlined_par_rgion {
>   I32* priv_val = alloca 32
>   I32* priv_i = ....
> 
>    Ret
> }
> 
> For OpenMP, we do support it at -O0, -O1, -O2 and -O3.  We had to make sure it runs correctly w/ and w/o optimizations and advanced analysis. So we need to preserve all source information for BE.  
> You can take a look our LLVM-HPC paper for a bit some details.  There are still tons of work to be done. Thanks. 
> 
> Xinmin 
> 
> -----Original Message-----
> From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of Sanjoy Das via llvm-dev
> Sent: Thursday, January 19, 2017 10:13 PM
> To: Adve, Vikram Sadanand <vadve at illinois.edu>
> Cc: llvm-dev <llvm-dev at lists.llvm.org>; llvm-dev-request at lists.llvm.org
> Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations
> 
> Hi Vikram,
> 
> On Thu, Jan 19, 2017 at 9:27 PM, Adve, Vikram Sadanand <vadve at illinois.edu> wrote:
>> Hi Sanjoy,
>> 
>> Yes, that's exactly what we have been looking at recently here, but 
>> the region tags seem to make it possible to express the control flow 
>> as well, so I think we could start with reg ions+metadata, as Hal and
> 
> I'm not yet convinced that region tags are sufficient to model exotic control flow.
> 
> (I don't know OpenMP so this is a copy-pasted-edited example)
> 
> Say we have:
> 
> void main() {
>  #pragma omp parallel num_threads(4)
>  {
>    int i = omp_get_thread_num();
>    int val;
>    compute_something_into_val(&val, i);
>    a[i] = val;
>  }
> }
> 
> I presume the (eventual) intended lowering is something like this (if the intended lowering is different than this, and avoids the issue I'm trying to highlight then my point is moot):
> 
> void main() {
>  tok = llvm.experimental.intrinsic_a();
> 
>  int i = omp_get_thread_num();
>  i32* val = alloca i32
>  compute_something_into_val(val, i);
>  a[i] = val;
> 
>  llvm.experimental.intrinsic_b(tok);
> }
> 
> However, LLVM is free to hoist the alloca to the entry block:
> 
> void main() {
>  i32* val = alloca i32
>  tok = llvm.experimental.intrinsic_a();
> 
>  int i = omp_get_thread_num();
>  compute_something_into_val(val, i);
>  a[i] = val;
> 
>  llvm.experimental.intrinsic_b(tok);
> }
> 
> and now you have a race between the four parallel forks.
> 
> The problem here is that nothing in the IR expresses that we have four copies of the region running "at the same time".  In fact, such a control flow is alien to LLVM today.
> 
> For instance, another evil optimization may turn:
> 
> void main() {
>  int a[4];
>  #pragma omp parallel num_threads(4)
>  {
>    int i = omp_get_thread_num();
>    int val = compute_something_into_val(i);
>    a[i] = val;
>  }
> 
>  return a[0] + a[1];
> }
> 
> to
> 
> void main() {
>  int a[4];
>  #pragma omp parallel num_threads(4)
>  {
>    int i = omp_get_thread_num();
>    int val = compute_something_into_val(i);
>    a[i] = val;
>  }
> 
>  return undef;
> }
> 
> since a[i] = val could have initialized at most one element in a.
> 
> Now you could say that the llvm.experimental.intrinsic_a and llvm.experimental.intrinsic_b intrinsics are magic, and even such "obvious" optimizations are not allowed to happen across them; but then calls to these intrinsics is pretty fundamentally different from "normal" calls, and are probably best modeled as new instructions.
> You're going to have to do the same kind of auditing of passes either way, and the only extra cost of a new instruction is the extra bitcode reading / writing code.
> 
> I hope I made sense.
> 
> -- Sanjoy
> 
>> Xinmin proposed, and then figure out what needs to be first class 
>> instructions.
> 
>> 
>> --Vikram Adve
>> 
>> 
>> 
>>> On Jan 19, 2017, at 11:03 PM, Sanjoy Das <sanjoy at playingwithpointers.com> wrote:
>>> 
>>> Hi,
>>> 
>>> My bias is to use both (b) and (d), since they have complementary 
>>> strengths.  We should use (b) for expressing concepts that can't be 
>>> semantically modeled as a call or invoke (this branch takes both its 
>>> successors), and (d) for expressing things that can be (this call may 
>>> never return), and annotation like things (this region (denoted by 
>>> def-use of a token) is a reduction).
>>> 
>>> I don't grok OpenMP, but perhaps we can come with one or two 
>>> "generalized control flow"-type instructions that can be used to 
>>> model the non-call/invoke like semantics we'd like LLVM to know 
>>> about, and model the rest with (d)?
>>> 
>>> -- Sanjoy
>>> 
>>> On Thu, Jan 19, 2017 at 8:28 PM, Hal Finkel via llvm-dev 
>>> <llvm-dev at lists.llvm.org> wrote:
>>>> 
>>>> On 01/19/2017 03:36 PM, Mehdi Amini via llvm-dev wrote:
>>>> 
>>>> 
>>>> On Jan 19, 2017, at 1:32 PM, Daniel Berlin <dberlin at dberlin.org> wrote:
>>>> 
>>>> 
>>>> 
>>>>> On Thu, Jan 19, 2017 at 1:12 PM, Mehdi Amini <mehdi.amini at apple.com> wrote:
>>>>> 
>>>>> 
>>>>> On Jan 19, 2017, at 12:04 PM, Daniel Berlin <dberlin at dberlin.org> wrote:
>>>>> 
>>>>> 
>>>>> 
>>>>> On Thu, Jan 19, 2017 at 11:46 AM, Mehdi Amini via llvm-dev 
>>>>> <llvm-dev at lists.llvm.org> wrote:
>>>>>> 
>>>>>> 
>>>>>>> On Jan 19, 2017, at 11:36 AM, Adve, Vikram Sadanand via llvm-dev 
>>>>>>> <llvm-dev at lists.llvm.org> wrote:
>>>>>>> 
>>>>>>> Hi Johannes,
>>>>>>> 
>>>>>>>> I am especially curious where you get your data from. Tapir [0] 
>>>>>>>> (and to some degree PIR [1]) have shown that, 
>>>>>>>> counterintuitively, only a few changes to LLVM passes are 
>>>>>>>> needed. Tapir was recently used in an MIT class with a lot of 
>>>>>>>> students and it seemed to work well with only minimal changes to 
>>>>>>>> analysis and especially transformation passes.
>>>>>>> 
>>>>>>> TAPIR is an elegant, small extension and, in particular, I think 
>>>>>>> the idea of asymmetric parallel tasks and control flow is a 
>>>>>>> clever way to express parallelism with serial semantics, as in 
>>>>>>> Cilk.  Encoding the control flow extensions as explicit 
>>>>>>> instructions is orthogonal to that, though arguably more elegant than using region tags + metadata.
>>>>>>> 
>>>>>>> However, Cilk is a tiny language compared with the full 
>>>>>>> complexity of other languages, like OpenMP.  To take just one 
>>>>>>> example, TAPIR cannot express the ORDERED construct of OpenMP.  A 
>>>>>>> more serious concern, IMO, is that TAPIR (like Cilk) requires 
>>>>>>> serial semantics, whereas there are many parallel languages, OpenMP included, that do not obey that restriction.
>>>>>>> Third, OpenMP has *numerous* clauses, e.g., REDUCTION or PRIVATE, 
>>>>>>> that are needed because without that, you’d be dependent on 
>>>>>>> fundamentally hard compiler analyses to extract the same 
>>>>>>> information for satisfactory parallel performance; realistic 
>>>>>>> applications cannot depend on the success of such analyses.
>>>>>> 
>>>>>> I agree with this, but I’m also wondering if it needs to be first 
>>>>>> class in the IR?
>>>>>> For example we know our alias analysis is very basic, and C/C++ 
>>>>>> have a higher constraint thanks to their type system, but we 
>>>>>> didn’t inject this higher level information that helps the 
>>>>>> optimizer as first class IR constructs.
>>>>> 
>>>>> 
>>>>> FWIW, while i agree with the general point, i wouldn't use this example.
>>>>> Because we pretty much still suffer to this day because of it (both 
>>>>> in AA, and devirt, and ...)  :) We can't always even tell fields 
>>>>> apart
>>>>> 
>>>>> 
>>>>> Is it inherent to the infrastructure, i.e. using metadata instead 
>>>>> of first class IR construct or is it just a “quality of implementation” issue?
>>>> 
>>>> Not to derail this conversation:
>>>> 
>>>> IMHO, At some point there is no real difference :)
>>>> 
>>>> Because otherwise, everything is a QOI issue.
>>>> 
>>>> IE if it's super tricky to get metadata that works well and works 
>>>> right, doesn't get lost, etc, and that's inherent to using metadata, 
>>>> that to me is not a QOI issue.
>>>> 
>>>> So could it be done with metadata? Probably?
>>>> But at the same time,  if it had been done with more first class 
>>>> constructs, it would have happened years ago  and been much lower cost.
>>>> 
>>>> 
>>>> This is what I meant by “inherent to the infrastructure”, thanks for 
>>>> clarifying.
>>>> 
>>>> 
>>>> To clarify, we were proposing metadata that is used as arguments to 
>>>> the region-annotation intrinsics. This metadata has the nice 
>>>> property that it does not get dropped (so it is just being used as a 
>>>> way of encoding whatever data structures are necessary without predefining a syntactic schema).
>>>> 
>>>> -Hal
>>>> 
>>>> 
>>>> —
>>>> Mehdi
>>>> 
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> llvm-dev at lists.llvm.org
>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>> 
>>>> 
>>>> --
>>>> Hal Finkel
>>>> Lead, Compiler Technology and Programming Languages Leadership 
>>>> Computing Facility Argonne National Laboratory
>>>> 
>>>> 
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> llvm-dev at lists.llvm.org
>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>> 
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev