<table border="1" cellspacing="0" cellpadding="8">
<tr>
<th>Issue</th>
<td>
<a href=https://github.com/llvm/llvm-project/issues/92344>92344</a>
</td>
</tr>
<tr>
<th>Summary</th>
<td>
[mlir][python] Tracking of PyOperations references in `liveOperations` map breaks down in reasonable scenarios
</td>
</tr>
<tr>
<th>Labels</th>
<td>
mlir
</td>
</tr>
<tr>
<th>Assignees</th>
<td>
</td>
</tr>
<tr>
<th>Reporter</th>
<td>
christopherbate
</td>
</tr>
</table>
<pre>
Background:
The PyMlirContext has a mechanism for creating unique references to `MlirOperation` objects. When a `PyOperation` object is created, it is interned into a map (together with an owning `py::object`) into a map called `liveOperations` held by the `PyMlirContext`. This allows for various checks (assertions in the C++ code or tests written in Python) to occur to ensure that objects have the appropriate lifetimes. Since MLIR IR has a nested structure, it's possible to create a reference to an Operation in Python and then do something funny (e.g. delete the parent), and then try to use the child op somehow. In C++, printing a freed `Operation` will probably crash, but in Python printing a `PyOperation` whose underlying `Operation*` has been freed should recover gracefully. For reasons like this (I think), the PyOperation retains a `valid` member variable which should be set to `false` in such situations, and most operations in `PyOperation` are guarded by a check to ensure `valid == true`.
This seems to work well, but there are several holes in how lifetimes of PyOperations are tracked (and therefore how they are inserted or cleaned up from the `liveOperations` map). These are mostly documented in `IRCore.cpp` as TODOs.
For example, the following code appears to crash:
```
@run
def testModuleCleanup():
ctx = Context()
module = Module.parse(r"""module @my_module {
func.func @my_func() {
return
}
}""", ctx)
func = module.body.operations[0]
assert ctx._get_live_module_count() == 1
assert ctx._get_live_operation_count() == 1
module = None
gc.collect()
assert ctx._get_live_module_count() == 0
assert ctx._get_live_operation_count() == 1
print(func)
```
The reason is that even though `module` (`PyModule`) was dereferenced and cleaned up, it didn't invalidate its children. Therefore, the `PyOperation` pointed to by `func` still has its valid bit set (and is still in the `liveOperations` map).
Furthermore, even if you don't do anything with `func`, it remaining in `liveOperations` is problematic. If you re-use the same Context, as in the below code, the MLIR C++ API and the underlying allocator may return a pointer that is still in `liveOperations` (since operations may be freed by MLIR but still exist in `liveOperations`, that is a valid state):
```
@run
def testModuleCleanup():
ctx = Context()
module = Module.parse(r"""module @my_module {
func.func @my_func() {
return
}
}""", ctx)
func = module.body.operations[0] # suppose the underlying Operation* == 0x5e89746a58b0
assert ctx._get_live_module_count() == 1
assert ctx._get_live_operation_count() == 1
module = None
gc.collect()
assert ctx._get_live_module_count() == 0
# live operations is still 1, this is expected. In `liveOperations` map,
# PyOperation are tracked by the underlying `Operation*`, which means the key 0x5e89746a58b0
# is still in the map
assert ctx._get_live_operation_count() == 1
# This can result in assertion unpredictably, because MLIR may return new Operation* == 0x5e89746a58b0,
# and the key exists in `liveOperation` map.
module = Module.parse(r"""module @my_module { }""")
```
An assertion will be encountered because of this line:
https://github.com/llvm/llvm-project/blame/main/mlir/lib/Bindings/Python/IRCore.cpp#L1164
Here you can see that even if we amend the assertion to include a check if it is `valid`, proceeding to update the `liveOperations` map may invalidate a reference still held by the Python user. If this is not allowed, one would have to start enforcing that a single `Context` in the Python API cannot create any more operations after deleting any top-level Module (the create module, delete module, create module sequence is problematic here as shown above). IMO the former is better vs the latter (unless there's another way to solve this problem).
In addition, compiling MLIR with CMAKE_BUILD_TYPE=RelWithDebInfo vs CMAKE_BUILD_TYPE=Debug appears to change the likelihood of encountering the assertion in the second example.
</pre>
<img width="1px" height="1px" alt="" src="http://email.email.llvm.org/o/eJzUWEtv4zgS_jXMpTCCItuxfcghsSdYY7u3G729GOwpoKiSxQ1Faviw43-_KJKy5e70YzCLATYIHFmpF-vxVRW5c3KvEe_Z4pEttjc8-M7Ye9FZ6bwZOrQ193hTm-Z0z8otKx8euXjZWxN0w2YP6VX6_NwhfDy9V9JujPb46qHjDjj0KDqupeuhNRaERe6l3kPQ8veAYLFFi1qgA2-A3ZUk4MOAlntpNLsrwdT_QeFdAb91qIETzcfTGxQgXZKODas2IOMLqT1ajQ09GDKGD8CqlTd79B1aOErfAddgjpqMYnflcKJzzR6SUHZXsmo95RZcKWyIUskDnu1wZAh0qBqoT-A7THZO3MHuygI-d9IBV8ocXXTHgVtpggPRoXhxZBp3Dm2UCFJHQRtWPbLqEYRpEIwFj847OFrpPWoi-njyndFkpzdghAiWHlC7YBF8x_3oQ-j4AaNMPgzWDFZyj6Bki1726Ar4p9QC4f273SfYfcrx0-g8NuC8DcIHi8m5rFo6GIxzslZI6pLrgV8iSm-5hrOLLqYC1w2ZoaEx4EyPviPvt0HrE_kAi30BDSr0ydqBW9SeVWvSfeb19kQqgktEopOqATNEgZ05FrDTo--IDwYrdUw9Dq3FFMOrPDpKpWCwpua1OoGw3HXEWAc_MX0i5etMPHbGIQTdoFWnnFAXguqBaMirNaLORrjOBNWARWEOaGFvucA2KHUq4MlYsMgd5YKSL3RMGXNkR0_6JTvEx8K7uNmi51K7ZOCBK9mQ2h77GlPCcQrZsZOiG7XXCA59rsCWK4fEIjW4QETSh5zk2f-9cR7MOfeJ8mtncIuwD9w2GGuCpySfpOZoHrDZls224G0gvcU1qkgHDrGP-HA09gWOqNQYFypijJocHtByBZ1RGA3qzPGS2mDaqY9cZPGWixfKg2qVk8piayxGVt_hKVJJTQWJDZWeUMgJTcIArTX9WOdfI0HPB1atqd7RJfvIZeoEjRGhR-0jJBHv7tPGWCzEMESfOfj8YfvBFTB1AiUCvvJ-UDgGvDUEIpRiERb4MCC3LhUi5e01NhOMpd_0dV7aoNNzg21ElPemCQo3dMAwsGpFyTUKAQAQ_pXiBCOaJYrzv_vIHimSpGLg1iGrVpZVVfodaeZlf3oevywfL0Lopw1aFPSR6egxKZvQWvTB6mtGttxeXtCXUS25TPjXK3OTgtk2211QgysuCc0WjyVbTOQlVCYxxfMe_TNFPB_hWZig_WhiyuTbzPkm21nN9zivXfoPozH_Yy8KYZSi3vRFCP6gjeUPOH_WzAiIrFqlQK3fTrjzhJDwjFpzbEx4ICDvTNh3VAzJWioEUhcb6PiGlB-5gwbPDaaJWHSpydz2G9loVi0JtCO8UFuS3qUGYVHHqkyVPlbT1-A1GBmL1BvCLkJFOt5dCc5TlyAQJ5kJv2rpI35mICHEilS5gX8PIa7KPFjCoD4bFn0jWziZAI1JR2qopZ5Su4yzy8WyfHqLPZdxmEn48rVm6WKTU9hzL0UBu6TC4i9jK3W8x0uhbwiV8lFqVOYYIWd0XRwWxhHl4eNu7M_TLkgDj-DeWOj5KRcv8OxjmzJh6rQ3zWbVysXxZNJ3SFyNuZPWp2QM9YUkCl-l898SmA6QNPMcSOe5xyny_f_A50_h5uTnL4JQuPph1QxcGAaTE22SI9Mx6QxRrwtcrZfzO75Y1T_Cqx-j8Z9Fur8MkC8P1QyI62rWGgvlNmWwjK_wdUDhsYlT77cBZzPKJsnToXE6D-UF5rtjLIlKI2SPXLvI8IKnb4WM1H0Ji2TQ_yQwV1rSiiU4zcEuqFj-550Kgh4sNlJ4GvHjEImCE-xF5Jigk8bjz6TkxaFJ-wh-5IqIPu4t-MnhyOD_Z4r_iyodk-87Tfhh6o648dQIqKOD0VL0s0dMm5JLSY1nKOu8Hxx9q55Y9bSXvgt1IUzPqielDuOfXwZr4vZcPdWK98iqJ2pK9EdJS0SyZtXTo9SN1HvHqqdxhX2ajMPV7N3t7d18avzfaNqnZkXxdYiTIUK2cETgPeYIXA7pDUgtVKAxOe8gss33A5MNiWI5WCMQyai4WQ5xdvheC48pM5kzputvnhQmVwJ5hQwObWy7Y_Fq49OlQLq6MBrhGPeytK8b6kvWA-rWWBGNo2NzcFLvVTTucscwVlfWRR1ZcE0axg1dn4BGjCmm8Jb6cNy4Y7_WtFkPvyg8oMo5Ga9NaM1OUvKkVm3GPf3y4ooCHP4eojuuxw5Ii5ujDfSogdfmgHFj2r3_kBcc26Mlrho9WXdIIKN4_MaqVdAKnUtrW7yO4NqkSx0eLwacUYe8M2fFXw5cOw28aWQq8Q0I0w9SkQMiGsT5avP-4e-_Pj_-a_du-_z53x9_ZbPtJ1S_Sd9tsd7p1pBdbxBtsQ77q7Ws43qfkol2eSU7YxoqsnPtpcBOMzeH0qEwuhk3wKsTpM-b5n7WrGdrfoP3t8vbxWpWLVbrm-5-sVjO1mVdrZa8xFLMmzu-QKz5bLlsF8tbvJH3VVnNy8XtXTkv1_N1McdqOZvNZ_V6Ua0XsxmblzRQqoLqujB2fyOdC3i_rmbz-Y3iNSoXLw-rKtV2xRbbG3sfYaAOe8fmpSIcvAjw0qt44RgZFlu2eBxS-S-28Jl6EHniy4V9clH4rQmR6rG2yF8cNJRUUud1I154OIGaW2ncTbDq_g8DWTw2QVU8-X8DAAD__yBi28w">