[PATCH] D49281: [Unroll/UnrollAndJam/Vectorizer/Distribute] Add followup loop attributes.

Fri Aug 17 16:24:08 PDT 2018

Meinersbur added inline comments.

================
Comment at: docs/LangRef.rst:5154

+'``llvm.loop.vectorize.followup_vectorized``' Metadata
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
----------------
hsaito wrote:
> I understand that the RST file update should talk about what happens today, but for the sake of code review,  it's good to discuss what could happen in reasonably foreseeable future so that we don't under-design things. 
> 
> I think we should be thinking ahead about
> 1) vectorizer peeling the loop, for example, for alignment optimization. Such peeled loop could be fully unrolled if the trip count is known, or vectorized with mask.
> 2) main vector loop could be fully unrolled
> 3) there may be more than one remainder loop, e.g., vectorized remainder followed by scalar remainder.
> 4) remainder loop may be fully unrolled.
> 
> All those situations could happen w/o programmer knowing it'll happen that way.
> 
> Some of the questions we want to think before the real need arises:
>    Will the loop attribute get dropped if the "loop" is fully unrolled?
>    How do we designate more than one remainder loop?
>    Will the loop attribute applicable for vectorized peel/remainder?
>    Should we have a way to designate runtime-DD non-vectorizable loop separately from remainder?
> 
  # I think a prologue/peel is analogous to epilogue/remainder. That is, a new `llvm.loop.vectorize.followup_peel` can be added.

  # Should be handled as two separate transformations (such as vectorize/interleave). That is `llvm.loop.disable_nonforced` would ensure that a loop does not unexpectedly disappear

  #  `llvm.loop.followup_remainder` should apply on any of the remainder loops. If a finer distinction is required, we can add more specific attributes.

  # This can already happen, at least with LoopUnroll/LoopUnrollandJam. The docs mentions that in this case the `followup_remainder` is dropped.

However, changing the model a transformation transform to can indeed raise some backward-compatibility issues. This also applies to the user-interface. If a programmer added 
```
#pragma clang loop vectorize(enable)
```
do they expect it to be unrolled as well? Loop peeling? D50480 is interesting here: At `-Os`, it uses masking instead of an epilogue to avoid a code copy. In this case `followup_remainder` explicitly states that there's not necessarily a remainder loop, so I don't see a problem here. But a programmer might expect more control over what the output structure is.

We can add more attributes to control this behaviour, such as `llvm.loop.vectorize.peel.enable`, `llvm.loop.vectorize.remainder.enable`, `llvm.loop.vectorize.allow_versioning`. The interesting question is, what is the default setting? 
If we go by the current behavior to maximize backwards-compatibility, remainder and versioning would be enabled by default (if not in `-Os`), peeling disabled because it is not yet implemented.

On the other side we probably do not want frontends to emit the most recent enable-metadata to get the best vectorization. So we would enable all features by default, but the output loops might be different from what the programmer intended before the feature is introduced.

We can enable all features unless the transformation is forced, in which case all deviations from the current transformation model needs to be explicitly enabled.

IMHO, we can decide this case-by-case, weighting compatibility concerns and optimization levels. Then again, such transformations does not influence the correctness of the output.

To be less concerned about compatibility issues, I could for now remove all followup-attributes except those that are 'central' to the transformation, and `followup_all`. For vectorization, there will always be the performance-critical vectorized loop (i.e. `followup_vectorized`), independent of whether there is a prologue, epilogue or fallback. For partial unrolling, it is always a unrolled loop.

  - Will the loop attribute get dropped if the "loop" is fully unrolled?

Yes. But it should not happen if `llvm.loop.disable_nonforced` is used and the unroll is not explicitly specified.

  - How do we designate more than one remainder loop?

Using different attributes. Like `followup_all` it is possibly to address a group of loops.

  - Will the loop attribute applicable for vectorized peel/remainder?

Only for the followup that addresses them

  - Should we have a way to designate runtime-DD non-vectorizable loop separately from remainder?

As mentioned sometime before, the typical reaction to 'loop not vectorized' is not 'ok, let's unroll it instead', but 'how can I make it vectorize'. So I don't think fallbacks are necessary (unless we can apply a sequence of transformation to multiple loop), but I am open if you think there is a need for such.

================
Comment at: docs/LangRef.rst:5251
+
+'``llvm.loop.unroll.followup_remainder``' Metadata
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
----------------
hsaito wrote:
> Remainder here may be unrolled again or fully unrolled (see the comments on vectorize metadata). What do we do for that?
> 
`followup_remainder` is ignored. If this it is not clear from the section in `TransformMetadata.rst`, please tell me.

================
Comment at: docs/LangRef.rst:5265-5266
 disable unroll and jam (so ``llvm.loop.unroll`` metadata will be left to the
 unroller, plus ``llvm.loop.unroll.disable`` metadata will disable unroll and jam
 too.)

----------------
@dmgreen  This directly contradicts the `nounroll_plus_unroll_and_jam` test case in `Transforms/LoopUnrollAndJam/pragma.ll`

================
Comment at: docs/LangRef.rst:5311

+'``llvm.loop.unroll_and_jam.followup_outer``' Metadata
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
----------------
hsaito wrote:
> Is there an assumption of unroll_and_jam operating only on a double loop and/or a perfect loop? Technically speaking, we can unroll_and_jam a loop if we can legally outerloop-vectorize. So, there can be multiple inner loops. 
There's still an outermost (unrolled) loop and an innermost (jammed) loop. We could also adds followups for middle loops.

If it is the naming that concerns you: Would you prefer `followup_unrolled` and `followup_jammed`?

================
Comment at: docs/LangRef.rst:5380

+'``llvm.loop.distribute.followup_noncyclic``' Metadata
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
----------------
hsaito wrote:
> Looks rather centric to distribute-for-vectorization.
> 
> Loop distribution can happen for many reasons (and it may be more than one reasons). Are we going to define followup_ Metadata for each of those reasons? What'll happen if a loop matches the characteristics of more than one Metadata?
There is no overlap between `cyclic` and `noncyclic`. For the [[ https://arxiv.org/abs/1805.03374| extended loop-transformations ]], the user would name the loops they want distributed.

Indeed, these followup are are specific to the current distribution pass. However, I think it is easy for any distribution to determine whether a loop has cyclic dependences and and those attributes to any output loop that matches.

`makeFollowupLoopID` can already combine attributes from multiple followups.

================
Comment at: lib/Transforms/Scalar/LoopUnrollAndJamPass.cpp:194

+  TransformationMode EnableMode = hasUnrollAndJamTransformation(L);
+  if (EnableMode & TM_Disable) {
----------------
Meinersbur wrote:
> dmgreen wrote:
> > This code will need rebasing. There is a check earlier that looks for disable metadata that could be replaced by this. Look for HasUnrollDisablePragma/HasUnrollAndJamDisablePragma. If the same was done for unrolling, I think that would remove the need for the IgnoreUser (although your comment about it is probably still true).
> I'd push towards refectoring-out the common parts of `computeUnrollCount` used by LoopUnroll and LoopUnrollAndJam. Currently `computeUnrollCount` uses lots of settings meant for LoopUnroll (`llvm.loop.unroll.` metadata which should not exist anymore, OptimizationRemarkMissed specific to LoopUnroll, `-unroll-count`, `PartialThreshold`, handling of full unroll, loop peeling that UnrollAndJam does not support, being used in a single call by UnrollAndJam for two different things: determining `ExplicitUnroll` (i.e. is normal unroll is forced) and the unroll-and-jam count). It's hard to understand the subtleties between those codes.
> 
> I gave up at some point and added the `IgnoreUser` flag to make test cases pass.
`IgnoreUser` exists for the `nounroll_plus_unroll_and_jam` in `LoopUnrollAndJam\pragma.ll`.

`llvm.loop.unroll.disable causes`hasUnrollTransformation` in `computeUnrollCount` to return `TM_Disable`. Unrolling inside `computeUnrollCount` is disabled setting the unroll factor to 0. UnrollAndJam then tries to use the that unroll factor.

Repository:
  rL LLVM

https://reviews.llvm.org/D49281