[llvm-dev] Musings on the TableGen -emit-dag-isel backend

Fri Nov 13 11:08:52 PST 2020

Can we skip the relaxation step if comments aren't being emitted? Comment
emission is off by default in CMake configuration.

~Craig

On Fri, Nov 13, 2020 at 10:59 AM Owen Anderson via llvm-dev <
llvm-dev at lists.llvm.org> wrote:

> This is the size of the table, not the size of the overall binary, right?
> I would imagine that a 4% growth in the size of the table is a
> substantially smaller growth in the total executable size of, say, clang.
> If the overall growth is minuscule (say, under 1%), then this seems like
> the clear path forward.
>
> I’m also optimistic that we might be able to find other ways to shrink the
> tables to make up the difference if we need to.
>
> —Owen
>
> On Nov 13, 2020, at 2:45 AM, Jay Foad via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
> I wouldn't want to be too hasty about simply removing the relaxation
> algorithm. The size and speed of the compiler affects all users, but the
> time to compile the compiler "only" affects us compiler developers. And I
> speak as a developer who is heavily affected by the time to compile the
> AMDGPU backend.
>
> One off-the-cuff idea (I haven't even looked at the code yet): could we
> pass in a null output stream while doing the relaxation phase, and use that
> to skip most of the actual textual formatting, followed by a final pass
> with a real output stream once the offset has been determined?
>
> Jay.
>
> On Thu, 12 Nov 2020 at 16:43, Krzysztof Parzyszek via llvm-dev <
> llvm-dev at lists.llvm.org> wrote:
>
>> This is great!  Thanks Paul!
>>
>> I think that the 9x reduction in compile-time is well worth the 4% size
>> increase.  TableGen's run-time has been a sore point and a source of
>> complaints for quite some time.
>>
>> --
>> Krzysztof Parzyszek  kparzysz at quicinc.com   AI tools development
>>
>> -----Original Message-----
>> From: llvm-dev <llvm-dev-bounces at lists.llvm.org> On Behalf Of Paul C.
>> Anagnostopoulos via llvm-dev
>> Sent: Thursday, November 12, 2020 10:23 AM
>> To: llvm-dev at lists.llvm.org
>> Subject: [EXT] [llvm-dev] Musings on the TableGen -emit-dag-isel backend
>>
>> A rather notorious aspect of TableGen is the time required to run the
>> -emit-dag-isel backend on some targets, including AMDGPU and X86. I added a
>> timing feature to TableGen and timed the AMDGPU run.
>>
>>
>> ===-------------------------------------------------------------------------===
>>                              TableGen Phase Timing
>> ===-------------------------------------------------------------------------===
>>   Total Execution Time: 733.6103 seconds (733.8740 wall clock)
>>
>>    ---User Time---   --System Time--   --User+System--   ---Wall Time---
>> --- Name ---
>>   645.0017 ( 87.9%)   0.2340 (100.0%)  645.2357 ( 88.0%)  645.2709 (
>> 87.9%)  Emit matcher table
>>   70.4501 (  9.6%)   0.0000 (  0.0%)  70.4501 (  9.6%)  70.5510 (  9.6%)
>> Convert to matchers
>>   14.6329 (  2.0%)   0.0000 (  0.0%)  14.6329 (  2.0%)  14.7638 (  2.0%)
>> Parse, build records
>>    2.1996 (  0.3%)   0.0000 (  0.0%)   2.1996 (  0.3%)   2.1871 (  0.3%)
>> Sort patterns
>>    1.0920 (  0.1%)   0.0000 (  0.0%)   1.0920 (  0.1%)   1.0961 (  0.1%)
>> Optimize matchers
>>    0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0050 (  0.0%)
>> Write output
>>   733.3763 (100.0%)   0.2340 (100.0%)  733.6103 (100.0%)  733.8740
>> (100.0%)  Total
>>
>> As you can see, most of the time is spent emitting the C++ code. A simple
>> step toward reducing the time is to use the --omit-comments option.
>> However, I am informed that trying to debug the pattern matching table
>> without comments is a hopeless task.
>>
>>
>> ===-------------------------------------------------------------------------===
>>                              TableGen Phase Timing
>> ===-------------------------------------------------------------------------===
>>   Total Execution Time: 162.9274 seconds (162.9173 wall clock)
>>
>>    ---User Time---   --System Time--   --User+System--   ---Wall Time---
>> --- Name ---
>>   75.0833 ( 46.1%)   0.0468 ( 42.9%)  75.1301 ( 46.1%)  75.1313 ( 46.1%)
>> Emit matcher table
>>   69.7948 ( 42.9%)   0.0000 (  0.0%)  69.7948 ( 42.8%)  69.8050 ( 42.8%)
>> Convert to matchers
>>   14.6173 (  9.0%)   0.0468 ( 42.9%)  14.6641 (  9.0%)  14.6668 (  9.0%)
>> Parse, build records
>>    2.2308 (  1.4%)   0.0000 (  0.0%)   2.2308 (  1.4%)   2.2191 (  1.4%)
>> Sort patterns
>>    1.0920 (  0.7%)   0.0000 (  0.0%)   1.0920 (  0.7%)   1.0921 (  0.7%)
>> Optimize matchers
>>    0.0000 (  0.0%)   0.0156 ( 14.3%)   0.0156 (  0.0%)   0.0030 (  0.0%)
>> Write output
>>   162.8182 (100.0%)   0.1092 (100.0%)  162.9274 (100.0%)  162.9173
>> (100.0%)  Total
>>
>> Emitting the C++ code for most pattern operators is straightforward.
>> However, three operators are more time-consuming: Matcher::Scope,
>> SwitchOpcode, and SwitchType. These operators take a list of child
>> patterns, each of which is emitted as a 1- to 3-byte size followed by the
>> child's pattern bytes. The size is coded as a variable-length sequence of
>> bytes, as is every integer in the matcher table. (Just for interest, the
>> average number of children per Scope operator is about 2.7.)
>>
>> In order to minimize the length of the size, the backend performs a sort
>> of relaxation algorithm, where it first tries a 1-byte size. If that fails,
>> it tries the actual required number of bytes. Trying involves calculating
>> the offset in the table for the child and then recursively generating the
>> entire child, passing it a string buffer as its output stream. When the
>> size is finally determined, that string buffer is appended to the actual
>> buffer passed to the generation function.
>>
>> Why does the number of bytes in the size matter to the child? Because the
>> offset of the child is determined by that size, and the offset is passed to
>> the child and then included in many comments emitted by the child. If the
>> size is wrong, then the offset is wrong, and then the comments are wrong.
>>
>> So it's clear that repetitive code generating is done because of the
>> relaxation algorithm. How bad is it? It turns out that the algorithm is
>> about O(1.7^(n-1)), where n is the depth of the pattern matching tree. The
>> depth of the AMDGPU tree is 13. Here are some interesting statistics:
>>
>> Number of pattern operators at depth 11:         35,000
>> Number of times those operators are regenerated: 20 million
>>
>> I think we have found the problem. So what can be done? I tried a quick
>> and dirty experiment. I forced all the child sizes to occupy a 3-byte
>> length, so that the relaxation algorithm was no longer necesseary. The
>> results are shown here.
>>
>>
>> ===-------------------------------------------------------------------------===
>>                              TableGen Phase Timing
>> ===-------------------------------------------------------------------------===
>>   Total Execution Time: 90.7302 seconds (90.8242 wall clock)
>>
>>    ---User Time---   --System Time--   --User+System--   ---Wall Time---
>> --- Name ---
>>   69.7324 ( 76.9%)   0.0000 (  0.0%)  69.7324 ( 76.9%)  69.7320 ( 76.8%)
>> Convert to matchers
>>   14.5705 ( 16.1%)   0.0312 ( 33.3%)  14.6017 ( 16.1%)  14.6958 ( 16.2%)
>> Parse, build records
>>    3.0576 (  3.4%)   0.0624 ( 66.7%)   3.1200 (  3.4%)   3.1192 (  3.4%)
>> Emit matcher table
>>    2.1840 (  2.4%)   0.0000 (  0.0%)   2.1840 (  2.4%)   2.1891 (  2.4%)
>> Sort patterns
>>    1.0920 (  1.2%)   0.0000 (  0.0%)   1.0920 (  1.2%)   1.0831 (  1.2%)
>> Optimize matchers
>>    0.0000 (  0.0%)   0.0000 (  0.0%)   0.0000 (  0.0%)   0.0050 (  0.0%)
>> Write output
>>   90.6366 (100.0%)   0.0936 (100.0%)  90.7302 (100.0%)  90.8242 (100.0%)
>> Total
>>
>> Now the matcher emitter phase is insignificant! Unfortunately, the size
>> of the matchter table increases from 451,430 bytes to 469,612 bytes, an
>> increase of 18,182 bytes or 4%.
>>
>> So one solution is not to care about the 4% increase and always use
>> 3-byte child sizes. A second solution is to add an option that specifies
>> the starting number of child size bytes and retains the relaxation
>> algorithm.
>> When building the system, we would specify --child-size-bytes=1. When
>> building for debugging, you could specify --child-size-bytes=3.
>>
>> Comments and suggestion gratefully accepted.
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
>
> _______________________________________________
> LLVM Developers mailing list
> llvm-dev at lists.llvm.org
> https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20201113/f6fd17b5/attachment.html>