[llvm-dev] [RFC] IR-level Region Annotations

Tian, Xinmin via llvm-dev llvm-dev at lists.llvm.org
Tue Jan 31 17:26:20 PST 2017


[XT] Back from Biz trips, trying to catch up with the discussion.

>>>>I agree that outlining function sub-bodies and passing in the function pointers to said outlined bodies to OpenMP helpers lets us correctly implement the semantics we need.  However, unless I severely misunderstood the thread, I thought the key idea was to move *away* from that representation and towards a representation that _allows_ optimization?

[XT]: Your understanding is correct.  But, the IR-level region annotation RFC is not just for OpenMP. OpenMP is one of usage cases.. 

>>>>My problem with representing parallel regions with intrinsic-denoted-regions is that we're lying to the optimizer about what the code actually does.  Calls, even to intrinsics, can "at worst" do some combination of the following:

- Write to and read from arbitrary memory
 - Have UB (but we're allowed to pretend that they don't)
 - Throw an exception
 - Never return, either by infinite looping or by calling exit(0)                             
 - Have memory synchronization operations, like fences, atomic loads,
   stores etc.
 - Have side effects like IO, volatile writes

[XT]  Based on Google and Xilinx's suggestion, the IR-level region annotation can use  token and tags with intrinsic functions to model region and memory dependency (use/def).  Above list are handled based on language rules. E.g.  OpenMP rule says,  in a parallel region,   throw an exception is allowed, but it has been caught within the region, i.e. no control-flow edge is allowed to across the region boundary.  "exit" is one exception which is allowed, as it terminate the program..  Our solution is to have FE and/or one central place in ME to deal with language specifics.    

>>>>However, if to preserve *correctness* you have to edit optimization passes and teach them that certain intrinsic calls have behavior
*outside* the set mentioned above then the instruction really does not have "call semantics".  `call @llvm.experimental.region_begin()` is really a fundamentally new instruction masquerading as an intrinsic, and it is probably better to call a spade a spade and represent it as a new instruction.

[XT] Yes and No. Yes: w.r.t region scope annotation, No: it is more than one new instruction,  it is more like a sequence of instructions. Assume we have a "fork" instruction,   omp fork and cilk fork/spawn semantics are differently in terms of stack frame allocation and ABI.   When we introduce a new instruction, the exact semantics needs to be defined, it can't be alter. Thus, we proposed to start with experimental_intrinsics, and it is proven working. We can always convert the intrinsics with token/tag to instructions when we have enough solid cases / justification for the part of model-agnostic for the conversion.  
   
>>>>>The setting for the examples I gave was not that "here is a case we need to get right".  The setting was that "here is a *symptom* that shows that we've lied to the optimizer".  We can go ahead and fix all the symptoms by adding bailouts to the respective passes, but that does not make us immune to passes that we don't know about e.g. downstream passes, passes that will be added later.  It also puts us in a weird spot around semantics of call instructions.

[XT] I would say, it is a design trade-off between having a central place to deal with specifics or make drastic changes to begin with from day one.  Our process  is  to have a central place to get all working,  then, turning off the support for some "symptoms" in this central place one-by-one to trigger downstream fails and fixed.  I think our ultimate goal is more or less same, just taking a different approach to get there.   The central place / prepare-phase for getting IR to a "canonical" form with help to address the issue . " downstream passes, passes that will be added later.  It also puts us in a weird spot around semantics of call instructions." you mentioned.  

Thanks for all questions, discussions and feedback. 

Xinmin 

-----Original Message-----
From: Sanjoy Das [mailto:sanjoy at playingwithpointers.com] 
Sent: Friday, January 20, 2017 11:36 AM
To: Tian, Xinmin <xinmin.tian at intel.com>
Cc: Yonghong Yan <yan at oakland.edu>; Mehdi Amini <mehdi.amini at apple.com>; llvm-dev at lists.llvm.org; llvm-dev-request at lists.llvm.org; Adve, Vikram Sadanand <vadve at illinois.edu>
Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations

Hi,

I'm going to club together some responses.

I agree that outlining function sub-bodies and passing in the function pointers to said outlined bodies to OpenMP helpers lets us correctly implement the semantics we need.  However, unless I severely misunderstood the thread, I thought the key idea was to move *away* from that representation and towards a representation that _allows_ optimization?

My problem with representing parallel regions with intrinsic-denoted-regions is that we're lying to the optimizer about what the code actually does.  Calls, even to intrinsics, can "at worst" do some combination of the following:

 - Write to and read from arbitrary memory
 - Have UB (but we're allowed to pretend that they don't)
 - Throw an exception
 - Never return, either by infinite looping or by calling exit(0)
 - Have memory synchronization operations, like fences, atomic loads,
   stores etc.
 - Have side effects like IO, volatile writes

If an intrinsic's behavior can be explained by some subset of the above, then you should not need to edit any pass to preserve _correctness_ -- all optimization passes (today) conservatively assume that calls that they don't understand have all of the behaviors outlined above.

However, if to preserve *correctness* you have to edit optimization passes and teach them that certain intrinsic calls have behavior
*outside* the set mentioned above then the instruction really does not have "call semantics".  `call @llvm.experimental.region_begin()` is really a fundamentally new instruction masquerading as an intrinsic, and it is probably better to call a spade a spade and represent it as a new instruction.

The setting for the examples I gave was not that "here is a case we need to get right".  The setting was that "here is a *symptom* that shows that we've lied to the optimizer".  We can go ahead and fix all the symptoms by adding bailouts to the respective passes, but that does not make us immune to passes that we don't know about e.g. downstream passes, passes that will be added later.  It also puts us in a weird spot around semantics of call instructions.

-- Sanjoy

On Fri, Jan 20, 2017 at 11:22 AM, Tian, Xinmin via llvm-dev <llvm-dev at lists.llvm.org> wrote:
> Yonghong, In our implementation (not open sourced), we don’t do 
> outlining the Front-End.  See my previous reply to Medhi’s email.
>
>
>
> Xinmin
>
>
>
> From: Yonghong Yan [mailto:yan at oakland.edu]
> Sent: Friday, January 20, 2017 11:18 AM
> To: Mehdi Amini
> Cc: Tian, Xinmin; llvm-dev at lists.llvm.org; 
> llvm-dev-request at lists.llvm.org; Adve, Vikram Sadanand
> Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations
>
>
>
> Xinmin,
>
>
>
> outlining turns a parallel program into a sequential one from 
> compiler's perspective, and that is why most of the parallel-ignorant pass would hurt.
> In your IR description for Sanjoy's example, does the current approach 
> of outlining impacting the way of the IR should be enhanced for parallelism?
>
>
>
> For that specific example (or other analysis and optimization SPMD) 
> and what is implemented in clang, I am not sure whether we are going 
> to change the frontend so not to outline the parallel region, or allow 
> to perform certain optimization such as hoisting that alloca in clang 
> which is not desired I believe. Or annotate the outlined function 
> together with the intrinsic_a so that hoisting can be performed, in 
> which case the instrisic_a would like
> this:
>
>
>
> tok = llvm.experimental.intrinsic_a()[ "DIR.PARALLEL"(), 
> "QUAL.PRIVATE"(i32* %val, i32 %i), "QUAL.NUM_THREADS"(i32 4) plus info for OUTLINED_call.
>
>
>
> Mehdi,
>
>
>
> I think i am asking the same question as you asked me.
>
>
>
> Yonghong
>
>
>
>
>
>
>
> On Fri, Jan 20, 2017 at 1:54 PM, Mehdi Amini via llvm-dev 
> <llvm-dev at lists.llvm.org> wrote:
>
>
>> On Jan 20, 2017, at 10:44 AM, Tian, Xinmin via llvm-dev 
>> <llvm-dev at lists.llvm.org> wrote:
>>
>> Sanjoy, the IR would be like something below. It is ok to hoist 
>> alloca instruction outside the region. There are some small changes 
>> in optimizer to understand region-annotation intrinsic.
>>
>> { void main() {
>>  i32* val = alloca i32
>>  tok = llvm.experimental.intrinsic_a()[ "DIR.PARALLEL"(),
>> "QUAL.PRIVATE"(i32* val), "QUAL.NUM_THREADS"(i32 4)]
>>
>>  int i = omp_get_thread_num();
>>  compute_something_into_val(val, i);
>>  a[i] = val;
>>
>>  llvm.experimental.intrinsic_b(tok)["DIR.END.PARALLEL"()];
>> }
>>
>> With above representation, we can do privatization and outlining as 
>> below
>>
>> { void main() {
>>  i32* val = alloca i32
>>  i32* I = alloca 32
>>  tok = llvm.experimental.intrinsic_a()[ "DIR.PARALLEL"(),
>> "QUAL.PRIVATE"(i32* %val, i32 %i), "QUAL.NUM_THREADS"(i32 4)]
>>
>>  %ii = omp_get_thread_num();
>>  compute_something_into_val(%val, %i);  a[i] = %val;
>>
>>  llvm.experimental.intrinsic_b(tok)["DIR.END.PARALLEL"()];
>> }
>>
>
> Here we come to the interesting part: the hoisting of  "i32* I = alloca 32”
> above the intrinsics required to update the intrinsics information 
> “QUAL.PRIVATE”.
> This means that the optimizer has to be aware of it, I’m missing the 
> magic here?
> I understand that an openmp specific optimization can do it, the 
> question is how it an openmp agnostic supposed to behave in face of 
> llvm.experimental.intrinsic_a?
>
>> Mehdi
>
>
>
>
>> 1. create i32* priv_val = alloca i32  %priv_i = ...in the region, and 
>> replace all  %val with %prv_val in the region.
>> 2. perform outlining.
>>
>> Caller code
>> ....
>> omp_push_num_threads(4)
>> omp_fork_call( .... outline_par_region....) ....
>>
>> Callee code:
>> Outlined_par_rgion {
>>   I32* priv_val = alloca 32
>>   I32* priv_i = ....
>>
>>    Ret
>> }
>>
>> For OpenMP, we do support it at -O0, -O1, -O2 and -O3.  We had to 
>> make sure it runs correctly w/ and w/o optimizations and advanced 
>> analysis. So we need to preserve all source information for BE.
>> You can take a look our LLVM-HPC paper for a bit some details.  There 
>> are still tons of work to be done. Thanks.
>>
>> Xinmin
>>
>> -----Original Message-----
>> From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of 
>> Sanjoy Das via llvm-dev
>> Sent: Thursday, January 19, 2017 10:13 PM
>> To: Adve, Vikram Sadanand <vadve at illinois.edu>
>> Cc: llvm-dev <llvm-dev at lists.llvm.org>; 
>> llvm-dev-request at lists.llvm.org
>> Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations
>>
>> Hi Vikram,
>>
>> On Thu, Jan 19, 2017 at 9:27 PM, Adve, Vikram Sadanand 
>> <vadve at illinois.edu> wrote:
>>> Hi Sanjoy,
>>>
>>> Yes, that's exactly what we have been looking at recently here, but 
>>> the region tags seem to make it possible to express the control flow 
>>> as well, so I think we could start with reg ions+metadata, as Hal 
>>> and
>>
>> I'm not yet convinced that region tags are sufficient to model exotic 
>> control flow.
>>
>> (I don't know OpenMP so this is a copy-pasted-edited example)
>>
>> Say we have:
>>
>> void main() {
>>  #pragma omp parallel num_threads(4)
>>  {
>>    int i = omp_get_thread_num();
>>    int val;
>>    compute_something_into_val(&val, i);
>>    a[i] = val;
>>  }
>> }
>>
>> I presume the (eventual) intended lowering is something like this (if 
>> the intended lowering is different than this, and avoids the issue 
>> I'm trying to highlight then my point is moot):
>>
>> void main() {
>>  tok = llvm.experimental.intrinsic_a();
>>
>>  int i = omp_get_thread_num();
>>  i32* val = alloca i32
>>  compute_something_into_val(val, i);
>>  a[i] = val;
>>
>>  llvm.experimental.intrinsic_b(tok);
>> }
>>
>> However, LLVM is free to hoist the alloca to the entry block:
>>
>> void main() {
>>  i32* val = alloca i32
>>  tok = llvm.experimental.intrinsic_a();
>>
>>  int i = omp_get_thread_num();
>>  compute_something_into_val(val, i);
>>  a[i] = val;
>>
>>  llvm.experimental.intrinsic_b(tok);
>> }
>>
>> and now you have a race between the four parallel forks.
>>
>> The problem here is that nothing in the IR expresses that we have 
>> four copies of the region running "at the same time".  In fact, such 
>> a control flow is alien to LLVM today.
>>
>> For instance, another evil optimization may turn:
>>
>> void main() {
>>  int a[4];
>>  #pragma omp parallel num_threads(4)
>>  {
>>    int i = omp_get_thread_num();
>>    int val = compute_something_into_val(i);
>>    a[i] = val;
>>  }
>>
>>  return a[0] + a[1];
>> }
>>
>> to
>>
>> void main() {
>>  int a[4];
>>  #pragma omp parallel num_threads(4)
>>  {
>>    int i = omp_get_thread_num();
>>    int val = compute_something_into_val(i);
>>    a[i] = val;
>>  }
>>
>>  return undef;
>> }
>>
>> since a[i] = val could have initialized at most one element in a.
>>
>> Now you could say that the llvm.experimental.intrinsic_a and 
>> llvm.experimental.intrinsic_b intrinsics are magic, and even such "obvious"
>> optimizations are not allowed to happen across them; but then calls 
>> to these intrinsics is pretty fundamentally different from "normal" 
>> calls, and are probably best modeled as new instructions.
>> You're going to have to do the same kind of auditing of passes either 
>> way, and the only extra cost of a new instruction is the extra 
>> bitcode reading / writing code.
>>
>> I hope I made sense.
>>
>> -- Sanjoy
>>
>>> Xinmin proposed, and then figure out what needs to be first class 
>>> instructions.
>>
>>>
>>> --Vikram Adve
>>>
>>>
>>>
>>>> On Jan 19, 2017, at 11:03 PM, Sanjoy Das 
>>>> <sanjoy at playingwithpointers.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> My bias is to use both (b) and (d), since they have complementary 
>>>> strengths.  We should use (b) for expressing concepts that can't be 
>>>> semantically modeled as a call or invoke (this branch takes both 
>>>> its successors), and (d) for expressing things that can be (this 
>>>> call may never return), and annotation like things (this region 
>>>> (denoted by def-use of a token) is a reduction).
>>>>
>>>> I don't grok OpenMP, but perhaps we can come with one or two 
>>>> "generalized control flow"-type instructions that can be used to 
>>>> model the non-call/invoke like semantics we'd like LLVM to know 
>>>> about, and model the rest with (d)?
>>>>
>>>> -- Sanjoy
>>>>
>>>> On Thu, Jan 19, 2017 at 8:28 PM, Hal Finkel via llvm-dev 
>>>> <llvm-dev at lists.llvm.org> wrote:
>>>>>
>>>>> On 01/19/2017 03:36 PM, Mehdi Amini via llvm-dev wrote:
>>>>>
>>>>>
>>>>> On Jan 19, 2017, at 1:32 PM, Daniel Berlin <dberlin at dberlin.org> wrote:
>>>>>
>>>>>
>>>>>
>>>>>> On Thu, Jan 19, 2017 at 1:12 PM, Mehdi Amini 
>>>>>> <mehdi.amini at apple.com>
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>> On Jan 19, 2017, at 12:04 PM, Daniel Berlin <dberlin at dberlin.org>
>>>>>> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Jan 19, 2017 at 11:46 AM, Mehdi Amini via llvm-dev 
>>>>>> <llvm-dev at lists.llvm.org> wrote:
>>>>>>>
>>>>>>>
>>>>>>>> On Jan 19, 2017, at 11:36 AM, Adve, Vikram Sadanand via 
>>>>>>>> llvm-dev <llvm-dev at lists.llvm.org> wrote:
>>>>>>>>
>>>>>>>> Hi Johannes,
>>>>>>>>
>>>>>>>>> I am especially curious where you get your data from. Tapir 
>>>>>>>>> [0] (and to some degree PIR [1]) have shown that, 
>>>>>>>>> counterintuitively, only a few changes to LLVM passes are 
>>>>>>>>> needed. Tapir was recently used in an MIT class with a lot of 
>>>>>>>>> students and it seemed to work well with only minimal changes 
>>>>>>>>> to analysis and especially transformation passes.
>>>>>>>>
>>>>>>>> TAPIR is an elegant, small extension and, in particular, I 
>>>>>>>> think the idea of asymmetric parallel tasks and control flow is 
>>>>>>>> a clever way to express parallelism with serial semantics, as 
>>>>>>>> in Cilk.  Encoding the control flow extensions as explicit 
>>>>>>>> instructions is orthogonal to that, though arguably more 
>>>>>>>> elegant than using region tags + metadata.
>>>>>>>>
>>>>>>>> However, Cilk is a tiny language compared with the full 
>>>>>>>> complexity of other languages, like OpenMP.  To take just one 
>>>>>>>> example, TAPIR cannot express the ORDERED construct of OpenMP.  
>>>>>>>> A more serious concern, IMO, is that TAPIR (like Cilk) requires 
>>>>>>>> serial semantics, whereas there are many parallel languages, 
>>>>>>>> OpenMP included, that do not obey that restriction.
>>>>>>>> Third, OpenMP has *numerous* clauses, e.g., REDUCTION or 
>>>>>>>> PRIVATE, that are needed because without that, you’d be 
>>>>>>>> dependent on fundamentally hard compiler analyses to extract 
>>>>>>>> the same information for satisfactory parallel performance; 
>>>>>>>> realistic applications cannot depend on the success of such analyses.
>>>>>>>
>>>>>>> I agree with this, but I’m also wondering if it needs to be 
>>>>>>> first class in the IR?
>>>>>>> For example we know our alias analysis is very basic, and C/C++ 
>>>>>>> have a higher constraint thanks to their type system, but we 
>>>>>>> didn’t inject this higher level information that helps the 
>>>>>>> optimizer as first class IR constructs.
>>>>>>
>>>>>>
>>>>>> FWIW, while i agree with the general point, i wouldn't use this 
>>>>>> example.
>>>>>> Because we pretty much still suffer to this day because of it 
>>>>>> (both in AA, and devirt, and ...)  :) We can't always even tell 
>>>>>> fields apart
>>>>>>
>>>>>>
>>>>>> Is it inherent to the infrastructure, i.e. using metadata instead 
>>>>>> of first class IR construct or is it just a “quality of 
>>>>>> implementation” issue?
>>>>>
>>>>> Not to derail this conversation:
>>>>>
>>>>> IMHO, At some point there is no real difference :)
>>>>>
>>>>> Because otherwise, everything is a QOI issue.
>>>>>
>>>>> IE if it's super tricky to get metadata that works well and works 
>>>>> right, doesn't get lost, etc, and that's inherent to using 
>>>>> metadata, that to me is not a QOI issue.
>>>>>
>>>>> So could it be done with metadata? Probably?
>>>>> But at the same time,  if it had been done with more first class 
>>>>> constructs, it would have happened years ago  and been much lower cost.
>>>>>
>>>>>
>>>>> This is what I meant by “inherent to the infrastructure”, thanks 
>>>>> for clarifying.
>>>>>
>>>>>
>>>>> To clarify, we were proposing metadata that is used as arguments 
>>>>> to the region-annotation intrinsics. This metadata has the nice 
>>>>> property that it does not get dropped (so it is just being used as 
>>>>> a way of encoding whatever data structures are necessary without 
>>>>> predefining a syntactic schema).
>>>>>
>>>>> -Hal
>>>>>
>>>>>
>>>>>>>>>> Mehdi
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> LLVM Developers mailing list
>>>>> llvm-dev at lists.llvm.org
>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>
>>>>>
>>>>> --
>>>>> Hal Finkel
>>>>> Lead, Compiler Technology and Programming Languages Leadership 
>>>>> Computing Facility Argonne National Laboratory
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> LLVM Developers mailing list
>>>>> llvm-dev at lists.llvm.org
>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>


More information about the llvm-dev mailing list