[clang] [Modules] No transitive source location change (PR #86912)

Chuanqi Xu via cfe-commits cfe-commits at lists.llvm.org
Sun Mar 31 18:54:44 PDT 2024


ChuanqiXu9 wrote:

> > If Clang is configured with 64 bit `SourceLocation`, we can't use 96 bits for serialization. We can at most use 64 bits for a slot. In that case, we can only assume the offset of source location **in its own module** (not the global offset!) is not large than 2^32. I feel this assumption may be valid in a lot of places.
> > Or otherwise we can use less bits for module file index (32 bits seems to be too much honestly), then we can use higher 16 bits to store the module file index, and leave the lower 48 bits to store the source location. In this case, the assumption becomes to "the offset of the source location may not large than 2^48". But it is slightly hard to believe we can reach such extreme cases.
> 
> Let's see if @statham-arm (who introduced the `SourceLocation::[U]IntTy` typedefs) wants to weight in here.
> 
> > I **tried** to encode this as two separate 32bit values. But it will break too many codes. Since a lot of places assume that we can encode the source location as an uint64_t.
> > What I mean is, with VBR6 format (https://llvm.org/docs/BitCodeFormat.html#variable-width-integer), we can save more space for small integers in **on-disk** .pcm files (the memory representation should be the same). For example, for a 64 bits unsigned int `1`, VBR6 can use only 6 bits to store that `000001` to represent the 64 bits value `1` in the on-disk representations. So that even if I don't use more slots to store the module file index, the size of the .pcm files will increase after all.
> 
> Right. My thinking was that single 64bit value with the module file index in the upper 32 bits would basically disable VBR6 encoding for the lower 32 bits. If we split this thing into two separate 32bit values, we are more likely to VBR6 encode both of them. But this would actually increase size for (what I assume is the most common case) local source locations. 

Yes, this is the trade offs.

> Still, I think having a rough idea of how alternative implementations compare would be great.
> 
> Do you have any data on how much recompilation this can save for real world projects?

I don't have a specific data though. But I think it is understandable that the feature is super important. And we can save countless unnecessary compilations after this feature.

After the patch, for reduced BMI, we're already able to avoid changing the BMI if we only change the definitions of non-inline function bodies in the module unit.

Further more, based on this patch, we can do further optimizations. e.g., if we split declaration ID like we did in this patch, we may be able to avoid recompilations if a unreferenced module unit adds or deletes declarations. This may be pretty common in practice:

```
export module interface:part1;
...

//--- 
export module interface:part2;
...

//--- export module interface;
export import :part1;
export import :part2;

//--- consumer.cppm
export module consumer;
import interface;
// only used declarations in interface:part1;
```

Then if the user only changes module unit `interface:part2`, then we can keep the BMI for `consumer.cppm` the same. That implies every user of `consumer.cppm` can avoid recompilations.

---

I see this is a killer feature for C++20 modules. I think it is significantly valuable. So I really want to make this (including later optimizations) into clang19.

https://github.com/llvm/llvm-project/pull/86912


More information about the cfe-commits mailing list