[clang] [clang] Better bitfield access units (PR #65742)

Wed Mar 13 23:43:48 PDT 2024

================
@@ -439,82 +444,194 @@ CGRecordLowering::accumulateBitFields(RecordDecl::field_iterator Field,
       Members.push_back(MemberInfo(bitsToCharUnits(StartBitOffset),
                                    MemberInfo::Field, nullptr, *Field));
     }
-    return;
+    return Field;
   }
 
-  // Check if OffsetInRecord (the size in bits of the current run) is better
-  // as a single field run. When OffsetInRecord has legal integer width, and
-  // its bitfield offset is naturally aligned, it is better to make the
-  // bitfield a separate storage component so as it can be accessed directly
-  // with lower cost.
-  auto IsBetterAsSingleFieldRun = [&](uint64_t OffsetInRecord,
-                                      uint64_t StartBitOffset) {
-    if (!Types.getCodeGenOpts().FineGrainedBitfieldAccesses)
-      return false;
-    if (OffsetInRecord < 8 || !llvm::isPowerOf2_64(OffsetInRecord) ||
-        !DataLayout.fitsInLegalInteger(OffsetInRecord))
-      return false;
-    // Make sure StartBitOffset is naturally aligned if it is treated as an
-    // IType integer.
-    if (StartBitOffset %
-            Context.toBits(getAlignment(getIntNType(OffsetInRecord))) !=
-        0)
-      return false;
-    return true;
-  };
+  // The SysV ABI can overlap bitfield storage units with both other bitfield
+  // storage units /and/ other non-bitfield data members. Accessing a sequence
+  // of bitfields mustn't interfere with adjacent non-bitfields -- they're
+  // permitted to be accessed in separate threads for instance.
+
+  // We split runs of bit-fields into a sequence of "access units". When we emit
+  // a load or store of a bit-field, we'll load/store the entire containing
+  // access unit. As mentioned, the standard requires that these loads and
+  // stores must not interfere with accesses to other memory locations, and it
+  // defines the bit-field's memory location as the current run of
+  // non-zero-width bit-fields. So an access unit must never overlap with
+  // non-bit-field storage or cross a zero-width bit-field. Otherwise, we're
+  // free to draw the lines as we see fit.
+
+  // Drawing these lines well can be complicated. LLVM generally can't modify a
+  // program to access memory that it didn't before, so using very narrow access
+  // units can prevent the compiler from using optimal access patterns. For
+  // example, suppose a run of bit-fields occupies four bytes in a struct. If we
+  // split that into four 1-byte access units, then a sequence of assignments
+  // that doesn't touch all four bytes may have to be emitted with multiple
+  // 8-bit stores instead of a single 32-bit store. On the other hand, if we use
+  // very wide access units, we may find ourselves emitting accesses to
+  // bit-fields we didn't really need to touch, just because LLVM was unable to
+  // clean up after us.
+
+  // It is desirable to have access units be aligned powers of 2 no larger than
+  // a register. (On non-strict alignment ISAs, the alignment requirement can be
+  // dropped.) A three byte access unit will be accessed using 2-byte and 1-byte
+  // accesses and bit manipulation. If no bitfield straddles across the two
+  // separate accesses, it is better to have separate 2-byte and 1-byte access
+  // units, as then LLVM will not generate unnecessary memory accesses, or bit
+  // manipulation. Similarly, on a strict-alignment architecture, it is better
+  // to keep access-units naturally aligned, to avoid similar bit
+  // manipulation synthesizing larger unaligned accesses.
+
+  // We do this in two phases, processing a sequential run of bitfield
+  // declarations.
+
+  // a) Bitfields that share parts of a single byte are, of necessity, placed in
+  // the same access unit. That unit will encompass a consecutive
+  // run where adjacent bitfields share parts of a byte. (The first bitfield of
+  // such an access unit will start at the beginning of a byte.)
+
+  // b) Accumulate adjacent access units when the combined unit is naturally
+  // sized, no larger than a register, and on a strict alignment ISA,
+  // aligned. Note that this requires lookahead to one or more subsequent access
+  // units. For instance, consider a 2-byte access-unit followed by 2 1-byte
+  // units. We can merge that into a 4-byte access-unit, but we would not want
+  // to merge a 2-byte followed by a single 1-byte (and no available tail
+  // padding).
+
+  // This accumulation is prevented when:
+  // *) it would cross a zero-width bitfield (ABI-dependent), or
+  // *) one of the candidate access units contains a volatile bitfield, or
+  // *) fine-grained bitfield access option is in effect.
+
+  CharUnits RegSize =
+      bitsToCharUnits(Context.getTargetInfo().getRegisterWidth());
+  unsigned CharBits = Context.getCharWidth();
+
+  RecordDecl::field_iterator Begin = FieldEnd;
+  CharUnits StartOffset;
+  uint64_t BitSize;
+  CharUnits BestEndOffset;
+  RecordDecl::field_iterator BestEnd = Begin;
+  bool Volatile;
 
-  // The start field is better as a single field run.
-  bool StartFieldAsSingleRun = false;
   for (;;) {
-    // Check to see if we need to start a new run.
-    if (Run == FieldEnd) {
-      // If we're out of fields, return.
-      if (Field == FieldEnd)
-        break;
-      // Any non-zero-length bitfield can start a new run.
-      if (!Field->isZeroLengthBitField(Context)) {
-        Run = Field;
-        StartBitOffset = getFieldBitOffset(*Field);
-        Tail = StartBitOffset + Field->getBitWidthValue(Context);
-        StartFieldAsSingleRun = IsBetterAsSingleFieldRun(Tail - StartBitOffset,
-                                                         StartBitOffset);
+    CharUnits Limit;
+    bool Barrier;
+    bool Install = false;
+
+    if (Field != FieldEnd && Field->isBitField()) {
+      uint64_t BitOffset = getFieldBitOffset(*Field);
+      if (Begin == FieldEnd) {
+        // Beginning a new access unit.
+        Begin = Field;
+        BestEnd = Begin;
+
+        assert(!(BitOffset % CharBits) && "Not at start of char");
+        StartOffset = bitsToCharUnits(BitOffset);
+        BitSize = 0;
+        Volatile = false;
+      } else if (BitOffset % CharBits) {
+        // Bitfield occupies the same char as previous.
+        assert(BitOffset == Context.toBits(StartOffset) + BitSize &&
+               "Concatenating non-contiguous bitfields");
+      } else {
+        // Bitfield begins a new access unit.
+        Limit = bitsToCharUnits(BitOffset);
+        Barrier = false;
+        if (Field->isZeroLengthBitField(Context) &&
+            (Context.getTargetInfo().useZeroLengthBitfieldAlignment() ||
+             Context.getTargetInfo().useBitFieldTypeAlignment()))
+          Barrier = true;
+        Install = true;
       }
-      ++Field;
-      continue;
+    } else if (Begin == FieldEnd) {
+      // Completed the bitfields.
+      break;
+    } else {
+      // End of the bitfield span, with active access unit.
+      auto Probe = Field;
+      while (Probe != FieldEnd && Probe->isZeroSize(Context))
+        ++Probe;
+      // We can't necessarily use tail padding in C++ structs, so the NonVirtual
+      // size is what we must use there.
+      Limit = Probe != FieldEnd ? bitsToCharUnits(getFieldBitOffset(*Probe))
+              : RD              ? Layout.getNonVirtualSize()
+                                : Layout.getDataSize();
+      Barrier = true;
+      Install = true;
     }
 
-    // If the start field of a new run is better as a single run, or
-    // if current field (or consecutive fields) is better as a single run, or
-    // if current field has zero width bitfield and either
-    // UseZeroLengthBitfieldAlignment or UseBitFieldTypeAlignment is set to
-    // true, or
-    // if the offset of current field is inconsistent with the offset of
-    // previous field plus its offset,
-    // skip the block below and go ahead to emit the storage.
-    // Otherwise, try to add bitfields to the run.
-    if (!StartFieldAsSingleRun && Field != FieldEnd &&
-        !IsBetterAsSingleFieldRun(Tail - StartBitOffset, StartBitOffset) &&
-        (!Field->isZeroLengthBitField(Context) ||
-         (!Context.getTargetInfo().useZeroLengthBitfieldAlignment() &&
-          !Context.getTargetInfo().useBitFieldTypeAlignment())) &&
-        Tail == getFieldBitOffset(*Field)) {
-      Tail += Field->getBitWidthValue(Context);
-      ++Field;
-      continue;
+    if (Install) {
+      // Found the start of a new access unit. Determine if that completes the
+      // current one, or potentially extends it.
+      Install = false;
+      CharUnits Size = bitsToCharUnits(BitSize + CharBits - 1);
+      if (BestEnd == Begin) {
+        // This is the initial access unit.
+        BestEnd = Field;
+        BestEndOffset = StartOffset + Size;
+        if (!BitSize || Types.getCodeGenOpts().FineGrainedBitfieldAccesses)
+          // A barrier, or we're fine grained.
+          Install = true;
+      } else if (Size > RegSize || Volatile)
+        // Too big to accumulate, or just-seen access unit contains a volatile.
----------------
rjmccall wrote:

The logic here is pretty subtle and could use a comment.  Since `Best` is non-empty but hasn't been emitted yet, it must not contain a `volatile` field.  Therefore, if we've seen a volatile field since `Begin`, it must come after `Best`.  We want to make sure we emit `Best` as a separate access unit, so we choose to install `Best` here rather than below after we've merged the new access unit into `Best`.

https://github.com/llvm/llvm-project/pull/65742