<html>

    <head>

      <base href="https://bugs.llvm.org/">

    </head>

    <body><table border="1" cellspacing="0" cellpadding="8">

        <tr>

          <th>Bug ID</th>

          <td><a class="bz_bug_link 

          bz_status_NEW "

   title="NEW - Poor present table performance"

   href="https://bugs.llvm.org/show_bug.cgi?id=46107">46107</a>

          </td>

        </tr>

        <tr>

          <th>Summary</th>

          <td>Poor present table performance

          </td>

        </tr>

        <tr>

          <th>Product</th>

          <td>OpenMP

          </td>

        </tr>

        <tr>

          <th>Version</th>

          <td>unspecified

          </td>

        </tr>

        <tr>

          <th>Hardware</th>

          <td>PC

          </td>

        </tr>

        <tr>

          <th>OS</th>

          <td>Linux

          </td>

        </tr>

        <tr>

          <th>Status</th>

          <td>NEW

          </td>

        </tr>

        <tr>

          <th>Severity</th>

          <td>enhancement

          </td>

        </tr>

        <tr>

          <th>Priority</th>

          <td>P

          </td>

        </tr>

        <tr>

          <th>Component</th>

          <td>Runtime Library

          </td>

        </tr>

        <tr>

          <th>Assignee</th>

          <td>unassignedbugs@nondot.org

          </td>

        </tr>

        <tr>

          <th>Reporter</th>

          <td>csdaley@lbl.gov

          </td>

        </tr>

        <tr>

          <th>CC</th>

          <td>llvm-bugs@lists.llvm.org

          </td>

        </tr></table>

      <p>

        <div>

        <pre>Created <span class=""><a href="attachment.cgi?id=23545" name="attach_23545" title="The benchmark that reveals slow present table performance">attachment 23545</a> <a href="attachment.cgi?id=23545&action=edit" title="The benchmark that reveals slow present table performance">[details]</a></span>

The benchmark that reveals slow present table performance

It takes a long time to add new entries to the OpenMP present table and a long

time to access pre-existing entries. I have attached a benchmark that captures

the data management requirements of the HPGMG mini-app. This benchmark shows

that adding new entries with [:0] takes a long time (see "Device init" code

section) and retrieving data takes a long time (see "target update from" code

section).

I have two code versions: code version #1 uses [:0] to attach device pointers,

code version #2 manually attaches device pointers in OpenMP target regions. I

have tested 4 configurations using LLVM/Clang-11 from Apr 9 2020.

Configurations 1 and 2 test the case where the present table is small for both

code versions - the effective bandwidth of the "target update from" directive

is 3.6 and 3.8 GB/s. Configurations 3 and 4 test the case where the present

table can be large for both code versions - the effective bandwidth in

configuration 3 is only 0.2 GB/s! The workaround code in configuration 4

achieves 3.8 GB/s, however, the time to initialize the data structure on the

device is more than 20x slower for both configurations 3 and 4 even though the

total problem size is identical for all 4 configurations. The full data is:

+ clang -Wall -Werror -Ofast -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda

slow-present-table.c -o slow-present-table

Small present table configuration. Both code versions achieve 3.6 and 3.8 GB/s.

+ srun -n 1 ./slow-present-table 100 100 10000 1

num_levels=100, num_blocks=100, block_size=10000, mem=0.745058 GB,

code_version=1

Host init time=0.244111 seconds

Device init time=0.747118 seconds

Device kernel time=0.012566 seconds

Transfers=100, Time=0.208275 seconds, Data=0.745058 GB, Bandwidth=3.577283 GB/s

SUCCESS

+ srun -n 1 ./slow-present-table 100 100 10000 2

num_levels=100, num_blocks=100, block_size=10000, mem=0.745058 GB,

code_version=2

Host init time=0.228604 seconds

Device init time=0.568662 seconds

Device kernel time=0.013114 seconds

Transfers=100, Time=0.198643 seconds, Data=0.745058 GB, Bandwidth=3.750740 GB/s

SUCCESS

Large present table configuration. Code version #1 is an order of magnitude

slower than code version #2 for the data transfer!!!

+ srun -n 1 ./slow-present-table 100 10000 100 1

num_levels=100, num_blocks=10000, block_size=100, mem=0.745058 GB,

code_version=1

Host init time=0.234948 seconds

Device init time=38.181732 seconds

Device kernel time=0.058576 seconds

Transfers=100, Time=3.147222 seconds, Data=0.745058 GB, Bandwidth=0.236735 GB/s

SUCCESS

+ srun -n 1 ./slow-present-table 100 10000 100 2

num_levels=100, num_blocks=10000, block_size=100, mem=0.745058 GB,

code_version=2

Host init time=0.236912 seconds

Device init time=20.666237 seconds

Device kernel time=0.056635 seconds

Transfers=100, Time=0.197634 seconds, Data=0.745058 GB, Bandwidth=3.769888 GB/s

SUCCESS

Personal discussion with Johannes Doerfert: "I looked at the code we run in the

slow version and (without profiling). I suspect the problem is that we have 3

entries for each mapped "ptr[:0]" in the Device.ShadowPtrMap. In the other

version we have one entry *temporarily* in there. At some point, I suspect,

this std::map becomes large and dealing with it slows down. It is unclear if we

need these mappings really or not. If so, we could potentially investigate a

more scalable data structure. I can imagine the init is slow because the map is

build, the update is slow because we iterate the map for each update. Maybe

there is also some overhead we introduce by going through a few dynamic

libraries trying to allocate and copy 0 bytes of data for the map with the

empty array section."</pre>

        </div>

      </p>

      <hr>

      <span>You are receiving this mail because:</span>

      <ul>

          <li>You are on the CC list for the bug.</li>

      </ul>

    </body>

</html>