[llvm-dev] [RFC] IR-level Region Annotations

Sat Jan 21 14:04:49 PST 2017

I missed this before answering my last email about how "as is” it does not sound like something that can scale non-intrusively in LLVM as a model-agnostic and general-purpose region semantic. 
Sanjoy explains it just better :), so right now my understanding is the same as what Sanjoy expressed below.

— 
Mehdi

> On Jan 20, 2017, at 11:35 AM, Sanjoy Das <sanjoy at playingwithpointers.com> wrote:
> 
> Hi,
> 
> I'm going to club together some responses.
> 
> I agree that outlining function sub-bodies and passing in the function
> pointers to said outlined bodies to OpenMP helpers lets us correctly
> implement the semantics we need.  However, unless I severely
> misunderstood the thread, I thought the key idea was to move *away*
> from that representation and towards a representation that _allows_
> optimization?
> 
> My problem with representing parallel regions with
> intrinsic-denoted-regions is that we're lying to the optimizer about
> what the code actually does.  Calls, even to intrinsics, can "at
> worst" do some combination of the following:
> 
> - Write to and read from arbitrary memory
> - Have UB (but we're allowed to pretend that they don't)
> - Throw an exception
> - Never return, either by infinite looping or by calling exit(0)
> - Have memory synchronization operations, like fences, atomic loads,
>   stores etc.
> - Have side effects like IO, volatile writes
> 
> If an intrinsic's behavior can be explained by some subset of the
> above, then you should not need to edit any pass to preserve
> _correctness_ -- all optimization passes (today) conservatively assume
> that calls that they don't understand have all of the behaviors
> outlined above.
> 
> However, if to preserve *correctness* you have to edit optimization
> passes and teach them that certain intrinsic calls have behavior
> *outside* the set mentioned above then the instruction really does not
> have "call semantics".  `call @llvm.experimental.region_begin()` is
> really a fundamentally new instruction masquerading as an intrinsic,
> and it is probably better to call a spade a spade and represent it as
> a new instruction.
> 
> The setting for the examples I gave was not that "here is a case we
> need to get right".  The setting was that "here is a *symptom* that
> shows that we've lied to the optimizer".  We can go ahead and fix all
> the symptoms by adding bailouts to the respective passes, but that
> does not make us immune to passes that we don't know about
> e.g. downstream passes, passes that will be added later.  It also puts
> us in a weird spot around semantics of call instructions.
> 
> -- Sanjoy
> 
> On Fri, Jan 20, 2017 at 11:22 AM, Tian, Xinmin via llvm-dev
> <llvm-dev at lists.llvm.org> wrote:
>> Yonghong, In our implementation (not open sourced), we don’t do outlining
>> the Front-End.  See my previous reply to Medhi’s email.
>> 
>> 
>> 
>> Xinmin
>> 
>> 
>> 
>> From: Yonghong Yan [mailto:yan at oakland.edu]
>> Sent: Friday, January 20, 2017 11:18 AM
>> To: Mehdi Amini
>> Cc: Tian, Xinmin; llvm-dev at lists.llvm.org; llvm-dev-request at lists.llvm.org;
>> Adve, Vikram Sadanand
>> Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations
>> 
>> 
>> 
>> Xinmin,
>> 
>> 
>> 
>> outlining turns a parallel program into a sequential one from compiler's
>> perspective, and that is why most of the parallel-ignorant pass would hurt.
>> In your IR description for Sanjoy's example, does the current approach of
>> outlining impacting the way of the IR should be enhanced for parallelism?
>> 
>> 
>> 
>> For that specific example (or other analysis and optimization SPMD) and what
>> is implemented in clang, I am not sure whether we are going to change the
>> frontend so not to outline the parallel region, or allow to perform certain
>> optimization such as hoisting that alloca in clang which is not desired I
>> believe. Or annotate the outlined function together with the intrinsic_a so
>> that hoisting can be performed, in which case the instrisic_a would like
>> this:
>> 
>> 
>> 
>> tok = llvm.experimental.intrinsic_a()[ "DIR.PARALLEL"(), "QUAL.PRIVATE"(i32*
>> %val, i32 %i), "QUAL.NUM_THREADS"(i32 4) plus info for OUTLINED_call.
>> 
>> 
>> 
>> Mehdi,
>> 
>> 
>> 
>> I think i am asking the same question as you asked me.
>> 
>> 
>> 
>> Yonghong
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> On Fri, Jan 20, 2017 at 1:54 PM, Mehdi Amini via llvm-dev
>> <llvm-dev at lists.llvm.org> wrote:
>> 
>> 
>>> On Jan 20, 2017, at 10:44 AM, Tian, Xinmin via llvm-dev
>>> <llvm-dev at lists.llvm.org> wrote:
>>> 
>>> Sanjoy, the IR would be like something below. It is ok to hoist alloca
>>> instruction outside the region. There are some small changes in optimizer to
>>> understand region-annotation intrinsic.
>>> 
>>> { void main() {
>>> i32* val = alloca i32
>>> tok = llvm.experimental.intrinsic_a()[ "DIR.PARALLEL"(),
>>> "QUAL.PRIVATE"(i32* val), "QUAL.NUM_THREADS"(i32 4)]
>>> 
>>> int i = omp_get_thread_num();
>>> compute_something_into_val(val, i);
>>> a[i] = val;
>>> 
>>> llvm.experimental.intrinsic_b(tok)["DIR.END.PARALLEL"()];
>>> }
>>> 
>>> With above representation, we can do privatization and outlining as below
>>> 
>>> { void main() {
>>> i32* val = alloca i32
>>> i32* I = alloca 32
>>> tok = llvm.experimental.intrinsic_a()[ "DIR.PARALLEL"(),
>>> "QUAL.PRIVATE"(i32* %val, i32 %i), "QUAL.NUM_THREADS"(i32 4)]
>>> 
>>> %ii = omp_get_thread_num();
>>> compute_something_into_val(%val, %i);
>>> a[i] = %val;
>>> 
>>> llvm.experimental.intrinsic_b(tok)["DIR.END.PARALLEL"()];
>>> }
>>> 
>> 
>> Here we come to the interesting part: the hoisting of  "i32* I = alloca 32”
>> above the intrinsics required to update the intrinsics information
>> “QUAL.PRIVATE”.
>> This means that the optimizer has to be aware of it, I’m missing the magic
>> here?
>> I understand that an openmp specific optimization can do it, the question is
>> how it an openmp agnostic supposed to behave in face of
>> llvm.experimental.intrinsic_a?
>> 
>> —
>> Mehdi
>> 
>> 
>> 
>> 
>>> 1. create i32* priv_val = alloca i32  %priv_i = ...in the region, and
>>> replace all  %val with %prv_val in the region.
>>> 2. perform outlining.
>>> 
>>> Caller code
>>> ....
>>> omp_push_num_threads(4)
>>> omp_fork_call( .... outline_par_region....) ....
>>> 
>>> Callee code:
>>> Outlined_par_rgion {
>>>  I32* priv_val = alloca 32
>>>  I32* priv_i = ....
>>> 
>>>   Ret
>>> }
>>> 
>>> For OpenMP, we do support it at -O0, -O1, -O2 and -O3.  We had to make
>>> sure it runs correctly w/ and w/o optimizations and advanced analysis. So we
>>> need to preserve all source information for BE.
>>> You can take a look our LLVM-HPC paper for a bit some details.  There are
>>> still tons of work to be done. Thanks.
>>> 
>>> Xinmin
>>> 
>>> -----Original Message-----
>>> From: llvm-dev [mailto:llvm-dev-bounces at lists.llvm.org] On Behalf Of
>>> Sanjoy Das via llvm-dev
>>> Sent: Thursday, January 19, 2017 10:13 PM
>>> To: Adve, Vikram Sadanand <vadve at illinois.edu>
>>> Cc: llvm-dev <llvm-dev at lists.llvm.org>; llvm-dev-request at lists.llvm.org
>>> Subject: Re: [llvm-dev] [RFC] IR-level Region Annotations
>>> 
>>> Hi Vikram,
>>> 
>>> On Thu, Jan 19, 2017 at 9:27 PM, Adve, Vikram Sadanand
>>> <vadve at illinois.edu> wrote:
>>>> Hi Sanjoy,
>>>> 
>>>> Yes, that's exactly what we have been looking at recently here, but
>>>> the region tags seem to make it possible to express the control flow
>>>> as well, so I think we could start with reg ions+metadata, as Hal and
>>> 
>>> I'm not yet convinced that region tags are sufficient to model exotic
>>> control flow.
>>> 
>>> (I don't know OpenMP so this is a copy-pasted-edited example)
>>> 
>>> Say we have:
>>> 
>>> void main() {
>>> #pragma omp parallel num_threads(4)
>>> {
>>>   int i = omp_get_thread_num();
>>>   int val;
>>>   compute_something_into_val(&val, i);
>>>   a[i] = val;
>>> }
>>> }
>>> 
>>> I presume the (eventual) intended lowering is something like this (if the
>>> intended lowering is different than this, and avoids the issue I'm trying to
>>> highlight then my point is moot):
>>> 
>>> void main() {
>>> tok = llvm.experimental.intrinsic_a();
>>> 
>>> int i = omp_get_thread_num();
>>> i32* val = alloca i32
>>> compute_something_into_val(val, i);
>>> a[i] = val;
>>> 
>>> llvm.experimental.intrinsic_b(tok);
>>> }
>>> 
>>> However, LLVM is free to hoist the alloca to the entry block:
>>> 
>>> void main() {
>>> i32* val = alloca i32
>>> tok = llvm.experimental.intrinsic_a();
>>> 
>>> int i = omp_get_thread_num();
>>> compute_something_into_val(val, i);
>>> a[i] = val;
>>> 
>>> llvm.experimental.intrinsic_b(tok);
>>> }
>>> 
>>> and now you have a race between the four parallel forks.
>>> 
>>> The problem here is that nothing in the IR expresses that we have four
>>> copies of the region running "at the same time".  In fact, such a control
>>> flow is alien to LLVM today.
>>> 
>>> For instance, another evil optimization may turn:
>>> 
>>> void main() {
>>> int a[4];
>>> #pragma omp parallel num_threads(4)
>>> {
>>>   int i = omp_get_thread_num();
>>>   int val = compute_something_into_val(i);
>>>   a[i] = val;
>>> }
>>> 
>>> return a[0] + a[1];
>>> }
>>> 
>>> to
>>> 
>>> void main() {
>>> int a[4];
>>> #pragma omp parallel num_threads(4)
>>> {
>>>   int i = omp_get_thread_num();
>>>   int val = compute_something_into_val(i);
>>>   a[i] = val;
>>> }
>>> 
>>> return undef;
>>> }
>>> 
>>> since a[i] = val could have initialized at most one element in a.
>>> 
>>> Now you could say that the llvm.experimental.intrinsic_a and
>>> llvm.experimental.intrinsic_b intrinsics are magic, and even such "obvious"
>>> optimizations are not allowed to happen across them; but then calls to these
>>> intrinsics is pretty fundamentally different from "normal" calls, and are
>>> probably best modeled as new instructions.
>>> You're going to have to do the same kind of auditing of passes either way,
>>> and the only extra cost of a new instruction is the extra bitcode reading /
>>> writing code.
>>> 
>>> I hope I made sense.
>>> 
>>> -- Sanjoy
>>> 
>>>> Xinmin proposed, and then figure out what needs to be first class
>>>> instructions.
>>> 
>>>> 
>>>> --Vikram Adve
>>>> 
>>>> 
>>>> 
>>>>> On Jan 19, 2017, at 11:03 PM, Sanjoy Das
>>>>> <sanjoy at playingwithpointers.com> wrote:
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> My bias is to use both (b) and (d), since they have complementary
>>>>> strengths.  We should use (b) for expressing concepts that can't be
>>>>> semantically modeled as a call or invoke (this branch takes both its
>>>>> successors), and (d) for expressing things that can be (this call may
>>>>> never return), and annotation like things (this region (denoted by
>>>>> def-use of a token) is a reduction).
>>>>> 
>>>>> I don't grok OpenMP, but perhaps we can come with one or two
>>>>> "generalized control flow"-type instructions that can be used to
>>>>> model the non-call/invoke like semantics we'd like LLVM to know
>>>>> about, and model the rest with (d)?
>>>>> 
>>>>> -- Sanjoy
>>>>> 
>>>>> On Thu, Jan 19, 2017 at 8:28 PM, Hal Finkel via llvm-dev
>>>>> <llvm-dev at lists.llvm.org> wrote:
>>>>>> 
>>>>>> On 01/19/2017 03:36 PM, Mehdi Amini via llvm-dev wrote:
>>>>>> 
>>>>>> 
>>>>>> On Jan 19, 2017, at 1:32 PM, Daniel Berlin <dberlin at dberlin.org> wrote:
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On Thu, Jan 19, 2017 at 1:12 PM, Mehdi Amini <mehdi.amini at apple.com>
>>>>>>> wrote:
>>>>>>> 
>>>>>>> 
>>>>>>> On Jan 19, 2017, at 12:04 PM, Daniel Berlin <dberlin at dberlin.org>
>>>>>>> wrote:
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Thu, Jan 19, 2017 at 11:46 AM, Mehdi Amini via llvm-dev
>>>>>>> <llvm-dev at lists.llvm.org> wrote:
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Jan 19, 2017, at 11:36 AM, Adve, Vikram Sadanand via llvm-dev
>>>>>>>>> <llvm-dev at lists.llvm.org> wrote:
>>>>>>>>> 
>>>>>>>>> Hi Johannes,
>>>>>>>>> 
>>>>>>>>>> I am especially curious where you get your data from. Tapir [0]
>>>>>>>>>> (and to some degree PIR [1]) have shown that,
>>>>>>>>>> counterintuitively, only a few changes to LLVM passes are
>>>>>>>>>> needed. Tapir was recently used in an MIT class with a lot of
>>>>>>>>>> students and it seemed to work well with only minimal changes to
>>>>>>>>>> analysis and especially transformation passes.
>>>>>>>>> 
>>>>>>>>> TAPIR is an elegant, small extension and, in particular, I think
>>>>>>>>> the idea of asymmetric parallel tasks and control flow is a
>>>>>>>>> clever way to express parallelism with serial semantics, as in
>>>>>>>>> Cilk.  Encoding the control flow extensions as explicit
>>>>>>>>> instructions is orthogonal to that, though arguably more elegant
>>>>>>>>> than using region tags + metadata.
>>>>>>>>> 
>>>>>>>>> However, Cilk is a tiny language compared with the full
>>>>>>>>> complexity of other languages, like OpenMP.  To take just one
>>>>>>>>> example, TAPIR cannot express the ORDERED construct of OpenMP.  A
>>>>>>>>> more serious concern, IMO, is that TAPIR (like Cilk) requires
>>>>>>>>> serial semantics, whereas there are many parallel languages, OpenMP
>>>>>>>>> included, that do not obey that restriction.
>>>>>>>>> Third, OpenMP has *numerous* clauses, e.g., REDUCTION or PRIVATE,
>>>>>>>>> that are needed because without that, you’d be dependent on
>>>>>>>>> fundamentally hard compiler analyses to extract the same
>>>>>>>>> information for satisfactory parallel performance; realistic
>>>>>>>>> applications cannot depend on the success of such analyses.
>>>>>>>> 
>>>>>>>> I agree with this, but I’m also wondering if it needs to be first
>>>>>>>> class in the IR?
>>>>>>>> For example we know our alias analysis is very basic, and C/C++
>>>>>>>> have a higher constraint thanks to their type system, but we
>>>>>>>> didn’t inject this higher level information that helps the
>>>>>>>> optimizer as first class IR constructs.
>>>>>>> 
>>>>>>> 
>>>>>>> FWIW, while i agree with the general point, i wouldn't use this
>>>>>>> example.
>>>>>>> Because we pretty much still suffer to this day because of it (both
>>>>>>> in AA, and devirt, and ...)  :) We can't always even tell fields
>>>>>>> apart
>>>>>>> 
>>>>>>> 
>>>>>>> Is it inherent to the infrastructure, i.e. using metadata instead
>>>>>>> of first class IR construct or is it just a “quality of
>>>>>>> implementation” issue?
>>>>>> 
>>>>>> Not to derail this conversation:
>>>>>> 
>>>>>> IMHO, At some point there is no real difference :)
>>>>>> 
>>>>>> Because otherwise, everything is a QOI issue.
>>>>>> 
>>>>>> IE if it's super tricky to get metadata that works well and works
>>>>>> right, doesn't get lost, etc, and that's inherent to using metadata,
>>>>>> that to me is not a QOI issue.
>>>>>> 
>>>>>> So could it be done with metadata? Probably?
>>>>>> But at the same time,  if it had been done with more first class
>>>>>> constructs, it would have happened years ago  and been much lower cost.
>>>>>> 
>>>>>> 
>>>>>> This is what I meant by “inherent to the infrastructure”, thanks for
>>>>>> clarifying.
>>>>>> 
>>>>>> 
>>>>>> To clarify, we were proposing metadata that is used as arguments to
>>>>>> the region-annotation intrinsics. This metadata has the nice
>>>>>> property that it does not get dropped (so it is just being used as a
>>>>>> way of encoding whatever data structures are necessary without
>>>>>> predefining a syntactic schema).
>>>>>> 
>>>>>> -Hal
>>>>>> 
>>>>>> 
>>>>>> —
>>>>>> Mehdi
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> LLVM Developers mailing list
>>>>>> llvm-dev at lists.llvm.org
>>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Hal Finkel
>>>>>> Lead, Compiler Technology and Programming Languages Leadership
>>>>>> Computing Facility Argonne National Laboratory
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> LLVM Developers mailing list
>>>>>> llvm-dev at lists.llvm.org
>>>>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>>>>> 
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>> 
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>> 
>> 
>> 
>> 
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>