[PATCH] D158675: [NFC][AsmPrinter] Refactor DbgVariable as a std::variant

Wed Sep 6 11:55:51 PDT 2023

scott.linder added inline comments.

================
Comment at: llvm/lib/CodeGen/AsmPrinter/DwarfDebug.h:149
+struct MMILoc {
+  mutable SmallVector<FrameIndexExpr, 1> FrameIndexExprs;
+};
----------------
fdeazeve wrote:
> scott.linder wrote:
> > fdeazeve wrote:
> > > Not for this patch, but it really bothers me that this is mutable. The original rationale must have been to allow sorting when we call the "get" method, but still...
> > It is almost spooky how you've pointed out the same things I toyed with while working on the patches originally!
> > 
> > I had a patch in the middle somewhere which just exposed the need to sort in the interface, but scrapped it early on. I think your `set` idea makes a lot of sense and I'll follow up with a patch for that as well.
> One thing to consider here: sorting (either in the current vector approach or through the internal impl of std::set) require calling `Expr->getFragmentInfo()->OffsetInBits` as many time as comparison operations are performed, and this is possibly done multiple times for the same expression without any caching.
> 
> Something to keep in mind as a possible improvement as well (perhaps computing & storing the offset in advance)
I tried to do a bit of (non-statistically-rigorous) benchmarking of a few approaches.

The versions are:

* main: unmodified main branch
* patched+uncached: most straightforward switch to std::set, without any caching of the fragment info
* patched+eager_cached: above, but assume the `OffsetInBits` will be needed, and cache it in the constructor of `FrameIndexExpr`
* patched+lazy_cached: above, but re-introduce a `mutable` member so the comparison operator can cache the `OffsetInBits` on first use

The input is a precodegen dump of llvm-tblgen from a FullLTO build of main. Not sure if this is really representative, but it at least contains multi-fragment MMI entries.

If there is a better way to do the benchmark I am happy to redo it, this is just the best I could come up with that would take a reasonable amount of time.

```
llc -O3 build-lto/bin/llvm-tblgen.0.5.precodegen.bc -o /dev/null -filetype=obj -time-passes

 ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---

main:
22.3635 ( 28.8%)  14.2489 ( 71.7%)  36.6124 ( 37.5%)  36.6142 ( 37.5%)  X86 Assembly Printer
21.9819 ( 28.1%)  14.6388 ( 71.8%)  36.6207 ( 37.2%)  36.6238 ( 37.2%)  X86 Assembly Printer

patched+uncached:
21.3484 ( 28.0%)  14.4208 ( 72.4%)  35.7693 ( 37.2%)  35.7706 ( 37.2%)  X86 Assembly Printer
21.4563 ( 28.2%)  14.4606 ( 72.2%)  35.9169 ( 37.4%)  35.9191 ( 37.4%)  X86 Assembly Printer

patched+eager_cached:
21.2332 ( 32.0%)  14.3204 ( 76.2%)  35.5536 ( 41.8%)  35.5555 ( 41.8%)  X86 Assembly Printer
21.6691 ( 31.7%)  14.5448 ( 74.4%)  36.2139 ( 41.2%)  36.2149 ( 41.2%)  X86 Assembly Printer

patched+lazy_cached:
22.4542 ( 27.9%)  14.9983 ( 71.0%)  37.4525 ( 36.9%)  37.4537 ( 36.9%)  X86 Assembly Printer
22.9115 ( 28.3%)  14.7169 ( 69.3%)  37.6284 ( 36.8%)  37.6299 ( 36.8%)  X86 Assembly Printer
```

I only ran 2 trials so there is a lot of room for variance, but it seems like everything other than patched+eager_cached is more-or-less comparable when comparing User Time as a percent of compilation time. This makes some sense to me, as I suspect the typical case is to not have multiple fragments anyway, so we would never need to compare the offsets, turning the eager caching into pure overhead.

I think for the sake of simplicity in the code it is not worth having the `mutable` cache at all. If you agree I can post a patch! If this doesn't seem convincing I can do a more rigorous comparison.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D158675/new/

https://reviews.llvm.org/D158675