[PATCH] D80383: Add AST_SIGNATURE record to unhashed control block of PCM files

Wed Jun 3 12:38:30 PDT 2020

dexonsmith added a comment.

In D80383#2068596 <https://reviews.llvm.org/D80383#2068596>, @dang wrote:

> I made the decision to make all relative offsets relative to the start of the AST block rather than their own sub-block or a neighboring block, in order to prevent adding complexity in `ASTReader`. My reasoning that it would be best if the reader didn't have to keep track of a bunch of offsets inside the bitcode at all times, but that it could just remember the offset of the AST block once and always use that for relative offsets. I agree that conceptually making the sub-blocks relocatable too would have been nice, but I figured it wasn't worth the extra complexity in the reader.

I think the extra complexity is small and worth it. The default abbreviation is 7-VBR, which is cheaper for small offsets. I doubt we'll bother to come back and fix this later, so might as well get it right now.

================
Comment at: clang/include/clang/Basic/Module.h:60

+  static Expected<ASTFileSignature> Create(StringRef Bytes) {
+    if (Bytes.size() != 20)
----------------
Can you make this `create`? I think new code we prefer `lowerCamelCase` for method names.

================
Comment at: clang/include/clang/Basic/Module.h:66-72
+    for (unsigned I = 0; I != 5; ++I) {
+      const unsigned BitOff = I * 4;
+      Signature[I] = ((uint32_t)Bytes[BitOff + 0] << 24) |
+                     ((uint32_t)Bytes[BitOff + 1] << 16) |
+                     ((uint32_t)Bytes[BitOff + 2] << 8) |
+                     ((uint32_t)Bytes[BitOff + 3]);
+    }
----------------
This code looks correct, and it's better factored out like this, but I still don't understand why we convert to 32-bit numbers. Have you looked at the code simplification from storing this as an array of 8-bit numbers?

================
Comment at: clang/include/clang/Serialization/ASTBitCodes.h:396-400
       /// Record code for the signature that identifiers this AST file.
       SIGNATURE = 1,

+      /// Record code for the signature of the AST block.
+      AST_SIGNATURE,
----------------
dang wrote:
> dang wrote:
> > dexonsmith wrote:
> > > These names and descriptions hard hard to differentiate. Is there another way of naming these that will be more clear?
> > > 
> > > (One idea I had is to create `CONTROL_BLOCK_HASH` and `AST_BLOCK_HASH` and then `SIGNATURE` could just be their hash-combine, but maybe you have another idea.)
> > I kept the same hasher when computing both of these which mitigates the cost. I don't see the need for also emitting a hash for the control block, there are some optional records that are not in both the AST block and the control block anyway.
> I also think that the `AST_BLOCK_HASH` and the `SIGNATURE` are enough information already. In most cases you can deduce if the control block was different by just checking if the signatures were different and the ASTs the same.
Thanks for updating to `AST_BLOCK_HASH`, I think that addressed the concern I had.

However, having thought further about it, I don't think it makes sense to create a new record type. I suggest just having a convention of writing the hashes out in a particular order.
- First record is the overall signature.
- Second record is the AST block hash.
- In the future, we can add a third record for the control block hash.

In which case, the (single) record name should be something generic like `SIGNATURE` or `SHA1_HASH`.

Note: I doubt there would be additional cost to computing the AST and control block hashes separately, but I agree there isn't a compelling use for the control block hash currently (although it would be nice to confirm properties like if the `CONTROL_BLOCK_HASH` were typically stable when sources changed).

Note: I don't think the content of the "hashed" control block vs. unhashed control block have been carefully considered from first principles. At some point we might do that. I expect there are currently things that change the control block that would not invalidate it for importers.

================
Comment at: clang/lib/Serialization/ASTWriter.cpp:1061

-  return Signature;
+  return std::make_pair(ASTSignature, Signature);
 }
----------------
dexonsmith wrote:
> Alternately, you could change `ASTFileSignature` to have three fields:
> - ControlBlockHash
> - ASTBlockHash
> - Signature
> 
> (You could construct the signature by taking the SHA1 of the two block hashes.)
Have you considered changing `ASTFileSignature` to have multiple fields, one for the signature and another for the AST block hash? Working with pairs and tuples is always awkward.

================
Comment at: clang/lib/Serialization/ASTWriter.cpp:1084
       PP.getHeaderSearchInfo().getHeaderSearchOpts().ModulesHashContent) {
-    Signature = createSignature(StringRef(Buffer.begin(), StartOfUnhashedControl));
+    Record.append(ASTSignature.begin(), ASTSignature.end());
+    Stream.EmitRecord(AST_SIGNATURE, Record);
----------------
dang wrote:
> dang wrote:
> > dexonsmith wrote:
> > > The default abbreviation (7-VBR) isn't great for hashes. Given that we're going to have two of these records, I suggest specifying a custom abbreviation.
> > > 
> > > If you were to split out a prep commit as suggested above to change `SIGNATURE` to be an array of 8-bit values (instead of the current array of 32-bit values), that would be a good opportunity to add an abbreviation.
> > I think encoding the signature as a blob makes the most sense if we make it an array of 20 8-bit values, which is what the hasher gives us. However it is not quite so simple, as it would need a big change in the way direct imports are serialized. The way it is currently done embeds multiple strings and signatures within a single unabbreviated record. I can replace the unabbreviated record with a sub-block that keeps track of a string table and a record for each import containing offsets inside the string table and blobs for the signature.
> But I do think it is outside the scope of this change to do this.
> However it is not quite so simple, as it would need a big change in the way direct imports are serialized. The way it is currently done embeds multiple strings and signatures within a single unabbreviated record.

I don't understand why switching to 8-bit values would affect this. You can just write each 8-bit value as a separate field in the record, just like it currently does for 32-bit values. I.e., the current code seems to work just as well:
```
      for (auto I : M.Signature)
        Record.push_back(I);
```

That said, having a single `IMPORTS` record is kind of ridiculous. It would be nice to clean that up (either as a follow-up or prep commit). I think your suggestion of having an `IMPORTS` sub-block is reasonable. A simpler approach would be just to emit `IMPORT` records. In the reader you can figure out implicitly how many there are (is the next record another `IMPORT`? If so keep reading). If you did this, I would suggest abbreviating the `IMPORT` record since it'll be well-structured.

For `SIGNATURE` record abbreviation, I wasn't think a blob, just something simple like:
```
  auto HashAbbrev = std::make_shared<BitCodeAbbrev>();
  HashAbbrev->Add(BitCodeAbbrevOp(Signature));
  for (int I = 0, E = 5; I != E; ++I)
    HashAbbrev->Add(BitCodeAbbrevOp(BitcodeAbbrevOp::Fixed, 32));
```

If you switch to 8-bit:
```
  auto HashAbbrev = std::make_shared<BitCodeAbbrev>();
  HashAbbrev->Add(BitCodeAbbrevOp(Signature));
  for (int I = 0, E = 20; I != E; ++I)
    HashAbbrev->Add(BitCodeAbbrevOp(BitcodeAbbrevOp::Fixed, 8));
```

================
Comment at: clang/lib/Serialization/ASTWriter.cpp:4729-4733
+        StringRef Name = M.Kind == MK_PrebuiltModule ||
+                                 M.Kind == MK_ExplicitModule ||
+                                 M.Kind == MK_ImplicitModule
+                             ? M.ModuleName
+                             : M.FileName;
----------------
dang wrote:
> dexonsmith wrote:
> > Is there ever a reason to use `M.FileName` here, or is that always redundant with what's in the control block? I wonder if we can just remove this complexity.
> > 
> The control block only tracks this information for direct imports although it can maybe be extended to do keep track of this for all dependencies. This bit of the table could then become and index inside the `IMPORTS` record in the control block.
Okay. That might be a nice follow-up or prep commit. If you're not interested in that tangent though I suggest adding an accessor `ModuleFile::isModule` that you can use here.

================
Comment at: clang/lib/Serialization/ASTWriterDecl.cpp:2437
   if (DeclOffsets.size() == Index)
-    DeclOffsets.emplace_back(Loc, Offset);
+    DeclOffsets.emplace_back(Loc, Offset, ASTBlockRange.first);
   else if (DeclOffsets.size() < Index) {
----------------
dang wrote:
> dexonsmith wrote:
> > I suggest making `WriteDecls` relative to the start of the `DECLTYPES_BLOCK_ID`. That will make the block itself (more) self-contained, and also make the offsets smaller (and therefore cheaper to store in bitcode).
> > 
> > I also don't think you need to actually emit that integer, it's just an implicit contract between the reader and writer.
> Not sure what you mean by I don't actually need to emit that integer? If you mean `ASTBlockRange.first` then I don't emit it but rather pass it to the constructor of DeclOffset so that it performs the subtraction there. Since it is a self contained type, it forces the Reader to provide a value for it, making it harder to read the raw relative offset.
Ah! I misread the code.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D80383/new/

https://reviews.llvm.org/D80383