[PATCH] D80713: [AMDGPU] Support disassembly for AMDGPU kernel descriptors

Wed Jul 1 11:21:01 PDT 2020

scott.linder added a comment.

In D80713#2110898 <https://reviews.llvm.org/D80713#2110898>, @jhenderson wrote:

> In D80713#2108650 <https://reviews.llvm.org/D80713#2108650>, @rochauha wrote:
>
> > In D80713#2106033 <https://reviews.llvm.org/D80713#2106033>, @jhenderson wrote:
> >
> > > No idea about the functional logic of this code, but I do wonder whether you'd be better off dividing it into smaller patches for testability.
> >
> >
> > This patch is for entirely disassembling the kernel descriptor symbol. It happens to be so that many directives affect the kernel descriptor. I will also be adding a new test in this patch.
>
>
> The problem is that you're trying to do it all at once. There's no need to implement the full disassembly at once. You can do it in a series of patches, one after the other that build towards the primary goal. For example, there are many `if (<X is some value>)` type cases, which you could omit - just assume those are all false, in earlier versions, and add them in (with corresponding testing) in a later patch. Similarly, you could assume that all results of `x & y` return some specific value, e.g. 0, and just print that for now. Yes, that means you won't support everything from the point at which this patch lands, but it will make each individual patch easier to reason with. This fits much better with LLVM's preferred approach - please see https://llvm.org/docs/DeveloperPolicy.html#incremental-development.

I'm curious if this is the intent of the document you linked, though? It says "The remaining inter-related work should be decomposed into unrelated sets of changes if possible.", but the disassembly of this directive is not decomposable into unrelated changes; if only part of the descriptor is disassembled then the whole `.amdhsa_kernel` block is invalid. I suppose in the interest of review one could break down the logical/atomic series of commits even further into inter-dependent patches that must be applied together? I think that causes confusion when doing git-bisect etc. so doesn't seem ideal. Or maybe the patches should be squashed back together after review but before they are committed?

I do think there needs to be more direct testing of the branches taken in the code. If decomposing the patch beyond the logical/atomic decomposition makes it easier to do that as an author and/or a reviewer then I am OK with it. In this case, though, a lot of the length in terms of SLOC is just repetition and a little redundancy, so breaking it up further would only hide that and make it harder to see where things are best removed or factored out.

================
Comment at: llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.cpp:1227
+
+  // We cannot accurately backward compute #VGPRs used from
+  // GRANULATED_WORKITEM_VGPR_COUNT. So we 'decode' as Code Object V3 predefined
----------------
This doesn't seem right. This symbol is not accurate without careful attention from an assembly author, so it will surely be incorrect when used by a disassembler.

We //can// accurately compute a VGPR count from the `GRANULATED_WORKITEM_VGPR_COUNT`, we just calculate e.g. `(GRANULATED_WORKITEM_VGPR_COUNT + 1) * granularity` where the `+ 1` is needed to account for the minimum allocation (i.e. a granulated count encoded as "0" actually indicates a 1 granule allocation) and where the scaling by granularity is device-specific. This will calculate the greatest value possible for `.amdhsa_next_free_vgpr` which will produce the same granulated count, but equally valid you could choose to calculate the minimum, or any value in between. All we care about is producing disassembly which results in the same descriptor. (Caveat: I would prove this to yourself by referencing https://llvm.org/docs/AMDGPUUsage.html#amdgpu-amdhsa-compute-pgm-rsrc1-gfx6-gfx10-table as I may have the actual calculation wrong; we just want to calculate the inverse of what the assembler does)

Your point that we can't recover the exact value the original author intended to place in the assembly text is true, but that is OK as long as we can always give them a valid input to the assembler which gives them the same output.

================
Comment at: llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.cpp:1232
+           << ".amdgcn.next_free_vgpr"
+           << "\n";
+
----------------
This can be a char literal, i.e. '\n', same elsewhere.

================
Comment at: llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.cpp:1245
+  //
+  // To to get the exact same bytes in re-assembled binary, we disassemble
+  // aamdhsa_next_free_sgpr as the amdgcn.next_free_sgpr assembler symbol and
----------------
For this and the above case we should have tests to prove this out. I.e. assemble sources to a binary, disassemble and reassemble it, and then compare the two binaries. Ideally we would do this for some edge cases around VGPR/SGPR allocation granularity.

There may need to be some fixup between disassembly and reassembly to account for the remaining non-reassembleable bits produced by llvm-objdump, but they should be pretty minor for a trivial kernel, and I would expect you could handle them with just `sed` which seems to be available to LIT tests.

================
Comment at: llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.cpp:1254
+  //
+  // The disassembler cannot recover the original values of those directives.
+  KdStream << Indent << ".amdhsa_reserve_vcc " << 0 << "\n";
----------------
Just as with the VGPR count, we cannot use the symbol to define this, and we //can// compute a (non-unique) input which produces the same output.

================
Comment at: llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.cpp:1264
+  if ((FourByteBuffer & amdhsa::COMPUTE_PGM_RSRC1_PRIORITY) >>
+      amdhsa::COMPUTE_PGM_RSRC1_PRIORITY_SHIFT) {
+    return MCDisassembler::Fail;
----------------
This shift isn't necessary, you just need to check for the presence of any set bits.

I also noticed when checking the types here that for some reason we declare the enum for these masks/shifts as `int32_t`, which makes one have to think about both integer promotion rules and then possibly which bitwise-operations are valid for signed integers. I think here the signed mask is promoted to unsigned, and you get what you want, but it may be good to go fix the definition of the masks separately.

================
Comment at: llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.cpp:1269
+           << ((FourByteBuffer &
+                amdhsa::COMPUTE_PGM_RSRC1_FLOAT_ROUND_MODE_32) >>
+               amdhsa::COMPUTE_PGM_RSRC1_FLOAT_ROUND_MODE_32_SHIFT)
----------------
I think a short-lived well-defined preprocessor macro could make the patch shorter and easier to read. For example:

```
#define PRINT_TRIVIAL_FIELD(DIRECTIVE, MASK)  do { \
  KdStream << Indent << DIRECTIVE " " << ((FourByteBuffer & amdhsa:: ## MASK) >> amdhsa:: ## MASK ## _SHIFT) << '\n';
  } while (0);
PRINT_TRIVIAL_FIELD(".amdhsa_float_round_mode_32", COMPUTE_PGM_RSRC1_FLOAT_ROUND_MODE_32)
PRINT_TRIVIAL_FIELD(".amdhsa_float_round_mode_16_64", COMPUTE_PGM_RSRC1_FLOAT_ROUND_MODE_16_64)
// ...
#undef HANDLE_TRIVIAL_FIELD
```

I think in part this is because the definition of the masks themselves is in terms of preprocessor macros.

================
Comment at: llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.cpp:1290
+  if ((FourByteBuffer & amdhsa::COMPUTE_PGM_RSRC1_PRIV) >>
+      amdhsa::COMPUTE_PGM_RSRC1_PRIV_SHIFT) {
+    return MCDisassembler::Fail;
----------------
Unnecessary shift

================
Comment at: llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.cpp:1301
+  if ((FourByteBuffer & amdhsa::COMPUTE_PGM_RSRC1_DEBUG_MODE) >>
+      amdhsa::COMPUTE_PGM_RSRC1_DEBUG_MODE_SHIFT) {
+    return MCDisassembler::Fail;
----------------
Unnecessary shift, same for all other cases below when checking that the bits under a mask are 0.

================
Comment at: llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.cpp:1345
+  return MCDisassembler::Success;
+} // decodeCOMPUTE_PGM_RSRC1()
+
----------------
I don't know the general conventions here, but I don't think I have seen a comment for the end of a function elsewhere in LLVM. I do know that it is required for namespaces, so maybe it is permitted for long functions?

================
Comment at: llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.cpp:1387
+
+  // 13th bit (ENABLE_EXCEPTION_ADDRESS_WATCH) must be 0.
+  if ((FourByteBuffer &
----------------
All of these comments seem redundant to me, especially when the condition is simplified to just:

```
if (Buffer & MASK) return Fail;
```

At the very least, repeating the actual bit indices here when they are a part of the mask definition seems verbose.

================
Comment at: llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.cpp:1472
+  DataExtractor DE(Bytes, /*IsLittleEndian=*/true, /*AddressSize=*/8);
+  // When we fail, we set:
+  //    Size = current cursor position (starting point of the chunk of bytes)
----------------
I don't understand the intent with carefully maintaining `Size` for the failure case. Aren't we certain by this point that this //should// be a kernel descriptor, and so the correct thing to do when we fail is to disassemble everything as `.byte` directives (i.e. set `Size` to the max value)? Why would we prefer to return a partial failure and have the disassembler start working on the remaining bytes as if they were instructions?

That would also shorten the patch and make it obvious that the `Size` tracking is correct.

================
Comment at: llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.cpp:1484
+
+  case 4: // 0 + 4
+    FourByteBuffer = DE.getU32(Cursor);
----------------
Rather than have these comments, which are still just filled with magic numbers, could we define these offsets more explicitly somewhere, as e.g. `amdhsa::GROUP_SEGMENT_FIXED_SIZE_OFFSET`? For example in `llvm/include/llvm/Support/AMDHSAKernelDescriptor.h` alongside the other definitions needed by the compiler? We could then also update the `static_assert`s there to use those definitions, so we aren't relying on inspection to know the same offsets are used everywhere. I.e. the following:

```
165 static_assert(
  1     sizeof(kernel_descriptor_t) == 64,
  2     "invalid size for kernel_descriptor_t");
  3 static_assert(
  4     offsetof(kernel_descriptor_t, group_segment_fixed_size) == 0,
  5     "invalid offset for group_segment_fixed_size");
  6 static_assert(
  7     offsetof(kernel_descriptor_t, private_segment_fixed_size) == 4,
  8     "invalid offset for private_segment_fixed_size");
  9 static_assert(
 10     offsetof(kernel_descriptor_t, reserved0) == 8,
 11     "invalid offset for reserved0");
...
```

Becomes:

```
165 static_assert(
  1     sizeof(kernel_descriptor_t) == 64,
  2     "invalid size for kernel_descriptor_t");
  3 static_assert(
  4     offsetof(kernel_descriptor_t, group_segment_fixed_size) == GROUP_SEGMENT_FIXED_SIZE_OFFSET,
  5     "invalid offset for group_segment_fixed_size");
  6 static_assert(
  7     offsetof(kernel_descriptor_t, private_segment_fixed_size) == PRIVATE_SEGMENT_FIXED_SIZE_OFFSET,
  8     "invalid offset for private_segment_fixed_size");
  9 static_assert(
 10     offsetof(kernel_descriptor_t, reserved0) == RESERVED0_OFFSET,
 11     "invalid offset for reserved0");
...
```

================
Comment at: llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.cpp:1630
+  // CP microcode requires the kernel descriptor to be 64 aligned.
+  if (Bytes.size() != 64 || KdAddress % 64 != 0)
+    return MCDisassembler::Fail;
----------------
The `!= 0` here is redundant.

================
Comment at: llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.cpp:1635
+  raw_string_ostream KdStream(Kd);
+  KdStream << ".amdhsa_kernel " << KdName.drop_back(3).str() << "\n";
+
----------------
Could you move the call to `.drop_back(3)` out into the calling function, so it appears next to the check for `.endswith(StringRef(".kd"))`?

================
Comment at: llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.cpp:1665
+    Size = 256;
+    return MCDisassembler::SoftFail;
+  }
----------------
I'm still not sure what we landed on for the semantics of `SoftFail` here?

================
Comment at: llvm/lib/Target/AMDGPU/Disassembler/AMDGPUDisassembler.h:83
+
+  DecodeStatus decodeCOMPUTE_PGM_RSRC1(uint32_t FourByteBuffer,
+                                       raw_string_ostream &KdStream) const;
----------------
Could you either rename this to satisfy the linter, or else explicitly suppress the lint with a comment like:

```
// NOLINTNEXTLINE(readability-identifier-naming)
```

================
Comment at: llvm/test/tools/llvm-objdump/ELF/AMDGPU/code-object-v3.ll:48
+; CHECK-NEXT: end_amdhsa_kernel
+define amdgpu_kernel void @my_kernel() {
+  ret void
----------------
I think we need more tests to ensure:

* The edge cases of the granularity calculation are correct
* That our promise of round-trip is fulfilled for at least some representative cases
* That some of the failure modes are handled how we expect

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D80713/new/

https://reviews.llvm.org/D80713