[llvm-dev] [GSoC 2016] Implementation of the packing transformation

Wed Jul 6 00:22:46 PDT 2016

Thank you for the explanation and ideas!

>> >> However, the modification would probably cause issues. For example, if
>> >> we apply tiling, we'll have to substitute old induction variables
>> >> before analyzing access functions.
>> >
>> > I do not understand this issue. Tiling is a schedule-only transformation
>> > that only considers data-dependences for legality. How do access
>> > functions come into play here? Why do induction variables need to be
>> > substituted.
>> >
>> > If you did not see this yet, any modification of the polyhedral access
>> > function will (except for non-affine accesses) directly be
>> > translated to LLVM-IR. This means, we need (almost) no code generation
>> > changes to make this work.
>>
>> Yes, I agree, tiling is a schedule-only transformation that only
>> considers data-dependences for legality and any modification of the
>> polyhedral access function is directly translated to LLVM-IR.
>>
>> I just wanted to say that if a schedule tree is modified in
>> ScheduleOptimizer and new induction variables are created (e.g. in
>> case of tiling), access relations and new access relations may stay
>> unchanged. It should probably be taken into account, if we were going
>> to modify an access relation according to its previous state.
>>
>> For example, in case of matrix multiplication, we get the following isl
>> ast
>>
>> if (1 && (&MemRef_7[1023][1056] <= &MemRef_5[0][0] ||
>> &MemRef_5[1055][1056] <= &MemRef_7[0][0]) && (&MemRef_6[1055][1024] <=
>> &MemRef_5[0][0] || &MemRef_5[1055][1056] <= &MemRef_6[0][0]))
>>
>>     {
>>       for (int c0 = 0; c0 <= 1055; c0 += 1)
>>         for (int c1 = 0; c1 <= 1055; c1 += 1)
>>           Stmt_12(c0, c1);
>>       for (int c0 = 0; c0 <= 1055; c0 += 1)
>>         for (int c1 = 0; c1 <= 1055; c1 += 1)
>>           for (int c2 = 0; c2 <= 1023; c2 += 1)
>>             Stmt_17(c0, c1, c2);
>>     }
>>
>> else
>>     {  /* original code */ }
>>
>> and the following memory access is generated (the corresponding new
>> memory access is NULL):
>>
>> { Stmt_17[i0, i1, i2] -> MemRef_6[i0, i2] }
>>
>> If we try to create BLIS micro and macro kernels, we get the following
>> isl ast
>>
>>   {
>>       // 1st level tiling - Tiles
>>       for (int c0 = 0; c0 <= 32; c0 += 1)
>>         for (int c1 = 0; c1 <= 32; c1 += 1) {
>>           // 1st level tiling - Points
>>           for (int c2 = 0; c2 <= 31; c2 += 1)
>>             for (int c3 = 0; c3 <= 31; c3 += 1)
>>               Stmt_12(32 * c0 + c2, 32 * c1 + c3);
>>         }
>>       // 1nd level tiling - Tiles
>>       for (int c0 = 0; c0 <= 65; c0 += 1)
>>         for (int c1 = 0; c1 <= 3; c1 += 1)
>>           for (int c2 = 0; c2 <= 10; c2 += 1) {
>>             // 1nd level tiling - Points
>>             // Register tiling - Tiles
>>             for (int c3 = 0; c3 <= 3; c3 += 1)
>>               for (int c4 = 0; c4 <= 11; c4 += 1)
>>                 for (int c5 = 0; c5 <= 255; c5 += 1) {
>>                   // Register tiling - Points
>>                   // 1st level tiling - Tiles
>>                   // 1st level tiling - Points
>>                   {
>>                     Stmt_17(16 * c0 + 4 * c3, 96 * c2 + 8 * c4, 256 * c1
>>                     + c5);
>>                     Stmt_17(16 * c0 + 4 * c3, 96 * c2 + 8 * c4 + 1,
>> 256 * c1 + c5);
>>                     Stmt_17(16 * c0 + 4 * c3, 96 * c2 + 8 * c4 + 2,
>> 256 * c1 + c5);
>>                     Stmt_17(16 * c0 + 4 * c3, 96 * c2 + 8 * c4 + 3,
>> 256 * c1 + c5);
>>                     Stmt_17(16 * c0 + 4 * c3, 96 * c2 + 8 * c4 + 4,
>> 256 * c1 + c5);
>>                     Stmt_17(16 * c0 + 4 * c3, 96 * c2 + 8 * c4 + 5,
>> 256 * c1 + c5);
>>                     Stmt_17(16 * c0 + 4 * c3, 96 * c2 + 8 * c4 + 6,
>> 256 * c1 + c5);
>>                     Stmt_17(16 * c0 + 4 * c3, 96 * c2 + 8 * c4 + 7,
>> 256 * c1 + c5);
>>                     Stmt_17(16 * c0 + 4 * c3 + 1, 96 * c2 + 8 * c4,
>> 256 * c1 + c5);
>>                     Stmt_17(16 * c0 + 4 * c3 + 1, 96 * c2 + 8 * c4 +
>> 1, 256 * c1 + c5);
>>                     Stmt_17(16 * c0 + 4 * c3 + 1, 96 * c2 + 8 * c4 +
>> 2, 256 * c1 + c5);
>>                     Stmt_17(16 * c0 + 4 * c3 + 1, 96 * c2 + 8 * c4 +
>> 3, 256 * c1 + c5);
>>                     Stmt_17(16 * c0 + 4 * c3 + 1, 96 * c2 + 8 * c4 +
>> 4, 256 * c1 + c5);
>>                     Stmt_17(16 * c0 + 4 * c3 + 1, 96 * c2 + 8 * c4 +
>> 5, 256 * c1 + c5);
>>                     Stmt_17(16 * c0 + 4 * c3 + 1, 96 * c2 + 8 * c4 +
>> 6, 256 * c1 + c5);
>>                     Stmt_17(16 * c0 + 4 * c3 + 1, 96 * c2 + 8 * c4 +
>> 7, 256 * c1 + c5);
>>                     Stmt_17(16 * c0 + 4 * c3 + 2, 96 * c2 + 8 * c4,
>> 256 * c1 + c5);
>>                     Stmt_17(16 * c0 + 4 * c3 + 2, 96 * c2 + 8 * c4 +
>> 1, 256 * c1 + c5);
>>                     Stmt_17(16 * c0 + 4 * c3 + 2, 96 * c2 + 8 * c4 +
>> 2, 256 * c1 + c5);
>>                     Stmt_17(16 * c0 + 4 * c3 + 2, 96 * c2 + 8 * c4 +
>> 3, 256 * c1 + c5);
>>                     Stmt_17(16 * c0 + 4 * c3 + 2, 96 * c2 + 8 * c4 +
>> 4, 256 * c1 + c5);
>>                     Stmt_17(16 * c0 + 4 * c3 + 2, 96 * c2 + 8 * c4 +
>> 5, 256 * c1 + c5);
>>                     Stmt_17(16 * c0 + 4 * c3 + 2, 96 * c2 + 8 * c4 +
>> 6, 256 * c1 + c5);
>>                     Stmt_17(16 * c0 + 4 * c3 + 2, 96 * c2 + 8 * c4 +
>> 7, 256 * c1 + c5);
>>                     Stmt_17(16 * c0 + 4 * c3 + 3, 96 * c2 + 8 * c4,
>> 256 * c1 + c5);
>>                     Stmt_17(16 * c0 + 4 * c3 + 3, 96 * c2 + 8 * c4 +
>> 1, 256 * c1 + c5);
>>                     Stmt_17(16 * c0 + 4 * c3 + 3, 96 * c2 + 8 * c4 +
>> 2, 256 * c1 + c5);
>>                     Stmt_17(16 * c0 + 4 * c3 + 3, 96 * c2 + 8 * c4 +
>> 3, 256 * c1 + c5);
>>                     Stmt_17(16 * c0 + 4 * c3 + 3, 96 * c2 + 8 * c4 +
>> 4, 256 * c1 + c5);
>>                     Stmt_17(16 * c0 + 4 * c3 + 3, 96 * c2 + 8 * c4 +
>> 5, 256 * c1 + c5);
>>                     Stmt_17(16 * c0 + 4 * c3 + 3, 96 * c2 + 8 * c4 +
>> 6, 256 * c1 + c5);
>>                     Stmt_17(16 * c0 + 4 * c3 + 3, 96 * c2 + 8 * c4 +
>> 7, 256 * c1 + c5);
>>                   }
>>                 }
>>           }
>>     }
>>
>> else
>>     {  /* original code */ }
>>
>> However, if we dump the memory access and the new memory access after
>> creation of the BLIS micro-kernel (e.g. after call of the
>> ScheduleTreeOptimizer::optimizeMatMulPattern), we'll see that they
>> remain the same.
>>
>> In this case, we should probably substitute old induction variables,
>> if we wanted to determine that the memory access actually has the
>> following form
>>
>> Stmt_17[c0, c1, c2, c3, c4, c5, c6, c7] -> MemRef_6[16 * c0 + 4 * c3 +
>> c6, 96 * c2 + 8 * c4, 256 * c1 + c5]
>>
>> and replace it with, for example, the following memory access
>>
>> Stmt_17[c0, c1, c2, c3, c4, c5, c6, c7] -> MemRef_x[4 * (c5 + 256 * c3) +
>> c6]
>
> Memory accesses are always described in function of the original
> iteration space dimensions. When code-generating the original (or
> modified) memory accesses, the original iteration space dimensions are
> automatically translated to the indiction variables introduced by the
> new schedule.
>
> This means if you want to perform a data-layout transformation, there
> are two ways. Either you define a relation (isl_map) that maps old array
> locations to new array locations and apply it to the existing access
> functions, or you derive from scratch
> a new isl map that relates statement instances in the original iteration
> space to memory locations in the new array.
>
> What you probably have more easily available is the mapping between
> statement instances scheduled in the already tiled loop nest and the
> memory locations they access. Such a mapping needs to be translated into
> one of the earlier representations.

I'll try to implement it soon.

>> >> 2. What should we do to change an array referenced by a memory access?
>> >>
>> >> If I'm not mistaken, we can set the value of MemoryAccess::BasePtr to do
>> >> it.
>> >
>> > To just make it work, it should be sufficient to set an access function
>> > that has an isl_id that points to a different ScopArrayInfo object.
>> >
>> > See:  IslExprBuilder::createAccessAddress
>> >
>> >   BaseExpr = isl_ast_expr_get_op_arg(Expr, 0);
>> >   BaseId = isl_ast_expr_get_id(BaseExpr);
>> >   isl_ast_expr_free(BaseExpr);
>> >
>> >   const ScopArrayInfo *SAI = ScopArrayInfo::getFromId(BaseId);
>> >   Base = SAI->getBasePtr();
>> >
>> > Now, just resetting the access function leaves the MemoryAccess in a
>> > rather inconsistent state. It would be good if we could think about what
>> > is needed to provide a consistent interface for such kind of
>> > functionality.
>>
>> What do you mean by an inconsistent state? Is it a wrong value of, for
>> example, the following fields: MemoryAccess::BaseName,
>> MemoryAccess::Sizes, MemoryAccess::ElementType?
>
> Right. getScopArrayInfo() might still return the original
> getScopArrayInfo, whereas the access relation already refers to a
> different array. For a first prototype, it is probably not necessary to
> fix all this, but before upstreaming some of this, we
> probably want to go over the interface and see if we improve the
> situation.

I think it would be good.

>> >> 3. Could we copy data to a new array in IslNodeBuilder::createMark?
>> >>
>> >> We can probably create mark nodes, which contain references to memory
>> >> accesses that should be modified. Subsequently, using information
>> >> about original and new access relations, IslNodeBuilder::createMark
>> >> would generate functions that perform such copying.
>> >
>> > Why mark nodes?
>>
>> If I'm not mistaken, position of a mark node in a schedule tree
>> determines a place in an isl ast, where it should be generated. I
>> think that it's convenient, if we want to specify where to perform
>> copying.
>
> I understand that you want to mark a location in the schedule tree. One
> option are mark nodes, the other option are extension nodes (see isl
> documentation).
>
> Please read through the documentation of extension nodes. I believe they
> are better suited for what you are looking for, as they behave like any
> other user statement.
>
>> > An option I have been thinking of was to create new -
>> > virtual ScopStmts that have specific semantics. E.g. ones that copy the
>> > content from one array location to another. We could test these using
>> > jscop and then use them in your schedule transformation to implement the
>> > packing.
>>
>> Yes, maybe it's a better option. What are 'new virtual ScopStmts'? Is
>> it a new class, which we should add?
>
> Currently we have normal ScopStmts and RegionStmts. We may now als
> introduce CopyStmts, which implement the
> mem-to-mem copying.
>
> How to model this best is something we (you ;) ) need to think about.
> One option might indeed be to use classes to model these different kind
> of statements.

OK. I'll try to find the best option.

>> Would its objects be created
>> during the work of ScheduleOptimizer?
>
> Yes, this  could make sense.

-- 
                                    Cheers, Roman Gareev.