[Mlir-commits] [mlir] Reimplementing target description concept using DLTI attribute (PR #92138)

Wed Jun 5 03:50:17 PDT 2024

rengolin wrote:

> My question is not about the current PR, but instead about how would one represent those systems with the boiler plate infra. Because from what I understood, right now it's a flat representation, so it works well for a single heterogeneous node, but more complex systems are not straightforward, and we should get that right from the start.

This is the next step. We have two proposals:
1. **Expand the dictionary representation**. We can make `#dlti.dl_entry` to contain another and cascade sub-ti, we can add arbitrary search inside the tree and have targets of targets represented in a single `#dlti.dl_entry` (for example: getTarget("GPU", "A100", "3") would get the third node in the "A100" sub-entry below the "GPU" top-entry).
2. **Add a generic entry type**. This means `#dlti.dl_entry` accepts a _"descriptor"_, which is an element in a compiler map that can access target information. This allows us to use LLVM's TTI or other downstream target descriptors that the IR knows nothing about (and doesn't need to).

I believe we need both extensions. Both are directly derivative of this PR and can be extended independently, so we really should not include it in this PR. Once we agree on the base structure, we'll continue with the more advanced strategies.

> My comment is about the case: `system1 = [deviceX[id = 0], deviceX[id = 1]]`, `system2 = [deviceX[id = 1], deviceX[id = 0]]`. In many cases `system1 == system2`, however, they are not the same by forcing ids. Moreover, it's also fragile, because a pass that works on `system1` might not work on `system2` because of the ordering.

This is only a problem if you don't know what device you're running and if you have more than one "system", which in the current PR design, is not reasonable. In the current design, _"system"_ is your **entire** system, including all devices in all nodes. Technically, having multiple systems with multiple devices is identical than having a single system with the same devices. Accessing them is slightly different, but equally unique (like array indexing `system[0][1]` vs `system0[1]`).

> Also, why do you need the ID to be stored in the device to query the properties? As far as I could tell, devices are placed in an array, so `attr.getEntries()[devId]` should work.

This means the order in which they appear in the IR is now relevant. It's a different design (means we need to sort maps before printing them out) but equally unique. In your multi-system example, `deviceX[id = 0]` is always the same device, regardless of which order it appears in the list and using some `getDeviceX(0)` would always return the one with `id = 0`, so there's no ambiguity.

The reason we use explicit (numeric) IDs in this PR is that we also want to have text IDs but haven't got around implementing it. 

This would help in many ways, for example:
* Generically naming the targets (ex. `CPU`, `GPU`) for simple systems.
* Using some device ID that is relevant to the driver (ex. `0xDEADBEEF`) to easily fetch properties via API calls.
* Encode information in the device ID (ex. `IPU[0][1]`) which could lead you to the second tile on the first device.

> My issue here is that this doesn't scale. Instead, the generic device attribute should only use a key lookup to find the value. If someone wants to create a more specialized attribute that contains L1, L2... they should be the ones adding those methods.

Correct. This is part of the two solutions I mention above. This is a planned extension, but we need to agree on the base structure now, which is the single role of this PR. The attributes encoded so far are only _examples_ of what can be done.

We don't plan to expand this to every possible combination of target attributes out there. We plan to have those for building the system's hierarchy, then using the generic entry type for target specific ones.

In summary, the answer to all your questions is simply: Yes, we agree with you, but this is a long walk and we've just started. We need to agree on the usage of DLTI and the base strategy for the system descriptor first. IF we do, then we continue implementing the rest in following PRs. 

However, some questions will only be answered when we start using the infra, and this is why we're doing step-wise. There's an issue from Intel's graph-compiler that links here, and that's one of the users. We hope others will have equivalent work, and we need to evolve together. 

It would be wrong to come up with a fixed _"generic"_ design right now, trying to predict all cases, and having no one actually using any of those. It's much simpler and faster to agree on a base design and improve from there, with the input of all users, not a single one.

https://github.com/llvm/llvm-project/pull/92138