[clang] [clang] Better bitfield access units (PR #65742)

Sun Dec 10 10:16:59 PST 2023

llvmbot wrote:



@llvm/pr-subscribers-backend-webassembly

@llvm/pr-subscribers-clang

Author: Nathan Sidwell (urnathan)

<details>
<summary>Changes</summary>

Reimplement the SysV (non-Microsoft) bitfield access unit computation. For avoidance of doubt, this is not an ABI change, but the memory accesses Clang emits to access bitfields.

This is a stack of two commits, to make it easier to see the change in behaviour.  The first commit is a set of new tests that verify the placement of access units under the current algorith.  The second commit is the change, and alters those tests as required. The guts of the change is in `clang/lib/CodeGen/CGRecordLayoutBuilder.cpp`.

The SysV ABI describes placing bitfields in 'storage units', which may end up overlapping eachother or other data members (for instance 'int bf :4;' may be in a 32-bit storage unit that overlaps a plan 'char c;' field). LLVM doesn't represent structures at that level, and Clang
must create a non-overlapping representation of the structure. That's what's happening here. To avoid confusion, I'm using the phrase 'access unit' to refer to the integer type Clang uses.

There are two aspects to chosing access units:

1) bitfields that share bits of a single byte must be in the same access unit -- you can't touch only one of those bitfields. This computation is unchanged.

2) adjacent bitfields that occupy  distinct bytes, might want to share an access  unit. One wants  to do this if  that makes a  better access size. This computation is changed.

The current algorithm computes both of these in a single pass. Computing if the next bitfield extends the current access unit by requiring it exacly abut the previous, and a few other constraints. But there are some issues with this:

a) If there's even a single bit of padding between the two, no concatenation occurs. This seems suboptimal.

b) Because of the above-mentioned overlapping, bitfield access units might not start on an aligned boundary, and some architectures do not support unaligned accesses.

c) Likewise, the access unit might not be able to be a natural 2^n size, and the adjacent bytes occupied (with either other data members, or with C++, a derived class occupying tail-padding bytes). We must emit exactly a (say) 24-bit access (to avoid aliasing issues?) (example below).

d) Concatentating access units can lead to an unaligned larger access, and again, that's problematic for strict alignment ISA. (consider `struct C { char a : 8; char b : 8;};` `C` has alignment 1, but a 2 byte access unit would naturally want 2-byte alignment.

My understanding is that larger access units were desired, to reduce the total number of accesses, and there was an assertion that the access would be narrowed if only some of the bytes changed.

That's not what we've observed, and its demonstrable on x86_64:
```
struct B {
  int a : 16;
  int b : 8;
  char c;
};
```
Here, B::a and B::b could be placed into a 3 byte access unit, but we can't use 4-byte accesses due to the 4th byte containing `c`. The code we end up generating for an increment of B::b is:
```
        movzbl  2(%rdi), %eax
        shll    $16, %eax
        movzwl  (%rdi), %ecx
        addl    %ecx, %eax
        addl    $65536, %eax                    # imm = 0x10000
        movw    %ax, (%rdi)
        shrl    $16, %eax
        movb    %al, 2(%rdi)
```

we're loading both `B::a` and `B::b`, sticking them together, incrementing `B::b` and then only storing the latter. I.e. we've reduced the memory accesses, but we're doing more work than necessary. Without the `B::c` field, there's 1 byte of padding, and a regular 4-byte access can be used (and is). X86_64 is not unique in this behaviour.

I'm of the opinion that one cannot do these in a single pass -- concatenating access units fundamentally requires knowing how large both units are, and you cannot tell that just by looking at the first bitfield of the second access unit.

That's what this patch does: separate the two algorithm into separate phases. We first determine which bitfields /must/ be in a single access unit, and then determine which of those access units could be concatentated. We pay attention to the storage-size of the integral type -- an i24 naturally requires 4 bytes of storage. When generating the access units we note barriers such as `:0` bitfields (on ABIs that care), and the starting location of the next non-bitfield or, the structure tail-padding.

The concatenation only occurs if:

a) The larger access is naturally aligned, or unaligned access is cheap.

b) The larger access storage size doesn't expand into a forbidden zone

c) The larger access is not bigger than a register.

d) Neither of the two access units contain a volatile bitfield.

(Note that if an access-unit's storage-size naturally overlaps a subsequent access unit, we'll merge the two provided. A partial overlap can occur with packed structures and is handled too.)

ETA: in essence, it only merges access units when the number of memory accesses to get at the merged unit is less than the number of accesses for the original unmerged set of access units.

In order to know the ending barrier location, we adjust the call to accumulateBitFields, to pass that in. There's a bit of reworking on accumulateFields to make that happen, and as an added bonus we can skip `[[no_unique_address]]` empty members that would otherwise split a bitfield run. (I suspect that never happens in practice though.)

An example of the change this makes is that the above x86_64 case now places the two bitfields in separate access units, because i16 and i8 can be better accessed than an i24. If there's no B::c member they're placed in an i32 access unit, which is effectively the same as now. (I see similar changes for arm32 target.)

I added a new Clang TargetInfo hook to specify whether the target has cheap unaligned memory accesses or not. That’s used by the new access unit concatenation code. LLVM has a CodeGen hook to tell you this (`TargetLoweringBase::allowsMisalignedMemoryAccesses`), but that’s not
really accessible from where we are in Clang. AFAICT creating TargetLoweringBase requires an llvm MachineFunction, and we don’t have one when laying out records. The new hook defaults to false, which is in keeping with it being ‘a 32-bit RISC platform, like PPC or SPARC’. That is at odds with the current access unit algorithm, which behaves as-if unaligned access has no cost. I went through the target architectures’ implementation of allowsMisalignedMemoryAccesses as best I could, to override that default. It’s possible I missed cases. For the record, the following targets set the flag to true (possibly under control of specific flags): AArch64, ARM, LoongArch, Mips, PPC, SystemZ, VE, WebAssembly & X86.

Clang already has a target hook for register size, which actually reports pointer-size – a reasonable guess, but incorrect for Loongson64, which only has an ILP32 ABI supported. But, that’s an orthogonal problem. (I didn’t check for similar cases, for instance a 32 bit ABI on a mips64 cpu.)

The performance data obtained from our internal benchmarks for an out-of-tree strict-alignment ILP32 target shows code size reduction of .4% to .75%, and performance increase of .2% to 1.2%.  Some benchmarks showed no change, they do not appear to be bitfield-centric.

The above-mentioned x86_64 case now places the two bitfields in different, naturally sized, access units (i16 and i8), and now generates the following assembly:
```
      incb    2(%rdi)
```

One of the ARM bitfield tests is likewise simplified.


---

Patch is 249.74 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/65742.diff


28 Files Affected:

- (modified) clang/include/clang/Basic/TargetInfo.h (+10) 
- (modified) clang/lib/Basic/TargetInfo.cpp (+1) 
- (modified) clang/lib/Basic/Targets/AArch64.cpp (+5) 
- (modified) clang/lib/Basic/Targets/ARM.cpp (+8) 
- (modified) clang/lib/Basic/Targets/LoongArch.h (+1) 
- (modified) clang/lib/Basic/Targets/Mips.h (+2) 
- (modified) clang/lib/Basic/Targets/PPC.h (+1) 
- (modified) clang/lib/Basic/Targets/SystemZ.h (+1) 
- (modified) clang/lib/Basic/Targets/VE.h (+1) 
- (modified) clang/lib/Basic/Targets/WebAssembly.h (+1) 
- (modified) clang/lib/Basic/Targets/X86.h (+1) 
- (modified) clang/lib/CodeGen/CGRecordLayoutBuilder.cpp (+250-81) 
- (added) clang/test/CodeGen/aapcs-bitfield-access-unit.c (+266) 
- (modified) clang/test/CodeGen/aapcs-bitfield.c (+568-1238) 
- (modified) clang/test/CodeGen/arm-bitfield-alignment.c (+19-4) 
- (modified) clang/test/CodeGen/arm64-be-bitfield.c (+17-3) 
- (added) clang/test/CodeGen/bitfield-access-unit.c (+114) 
- (modified) clang/test/CodeGen/debug-info-bitfield-0-struct.c (+4-2) 
- (modified) clang/test/CodeGen/no-bitfield-type-align.c (+9-1) 
- (modified) clang/test/CodeGen/struct-x86-darwin.c (+57-12) 
- (added) clang/test/CodeGenCXX/bitfield-access-empty.cpp (+104) 
- (added) clang/test/CodeGenCXX/bitfield-access-tail.cpp (+95) 
- (added) clang/test/CodeGenCXX/bitfield-ir.cpp (+101) 
- (modified) clang/test/CodeGenCXX/bitfield.cpp (+97-4) 
- (modified) clang/test/OpenMP/atomic_capture_codegen.cpp (+4-4) 
- (modified) clang/test/OpenMP/atomic_read_codegen.c (+2-2) 
- (modified) clang/test/OpenMP/atomic_update_codegen.cpp (+4-4) 
- (modified) clang/test/OpenMP/atomic_write_codegen.c (+4-4) 


``````````diff

diff --git a/clang/include/clang/Basic/TargetInfo.h b/clang/include/clang/Basic/TargetInfo.h
index ec0189627dfbd2..de995704540020 100644
--- a/clang/include/clang/Basic/TargetInfo.h
+++ b/clang/include/clang/Basic/TargetInfo.h
@@ -266,6 +266,9 @@ class TargetInfo : public TransferrableTargetInfo,
   LLVM_PREFERRED_TYPE(bool)
   unsigned AllowAMDGPUUnsafeFPAtomics : 1;
 
+  LLVM_PREFERRED_TYPE(bool)
+  unsigned HasCheapUnalignedBitfieldAccess : 1;
+
   unsigned ARMCDECoprocMask : 8;
 
   unsigned MaxOpenCLWorkGroupSize;
@@ -856,6 +859,13 @@ class TargetInfo : public TransferrableTargetInfo,
     return PointerWidth;
   }
 
+  /// Return true, iff unaligned accesses are cheap. This affects placement and
+  /// accumlation of bitfield access units. (Not the ABI-mandated placement of
+  /// the bitfields themselves.)
+  bool hasCheapUnalignedBitfieldAccess() const {
+    return HasCheapUnalignedBitfieldAccess;
+  }
+
   /// \brief Returns the default value of the __USER_LABEL_PREFIX__ macro,
   /// which is the prefix given to user symbols by default.
   ///
diff --git a/clang/lib/Basic/TargetInfo.cpp b/clang/lib/Basic/TargetInfo.cpp
index 6cd5d618a4acaa..8b532731245091 100644
--- a/clang/lib/Basic/TargetInfo.cpp
+++ b/clang/lib/Basic/TargetInfo.cpp
@@ -157,6 +157,7 @@ TargetInfo::TargetInfo(const llvm::Triple &T) : Triple(T) {
   HasAArch64SVETypes = false;
   HasRISCVVTypes = false;
   AllowAMDGPUUnsafeFPAtomics = false;
+  HasCheapUnalignedBitfieldAccess = false;
   ARMCDECoprocMask = 0;
 
   // Default to no types using fpret.
diff --git a/clang/lib/Basic/Targets/AArch64.cpp b/clang/lib/Basic/Targets/AArch64.cpp
index c31f2e0bee5439..328c36c826f86a 100644
--- a/clang/lib/Basic/Targets/AArch64.cpp
+++ b/clang/lib/Basic/Targets/AArch64.cpp
@@ -186,6 +186,8 @@ AArch64TargetInfo::AArch64TargetInfo(const llvm::Triple &Triple,
   assert(UseBitFieldTypeAlignment && "bitfields affect type alignment");
   UseZeroLengthBitfieldAlignment = true;
 
+  HasCheapUnalignedBitfieldAccess = true;
+
   // AArch64 targets default to using the ARM C++ ABI.
   TheCXXABI.set(TargetCXXABI::GenericAArch64);
 
@@ -988,6 +990,9 @@ bool AArch64TargetInfo::handleTargetFeatures(std::vector<std::string> &Features,
   if (HasNoSVE)
     FPU &= ~SveMode;
 
+  if (!HasUnaligned)
+    HasCheapUnalignedBitfieldAccess = false;
+
   return true;
 }
 
diff --git a/clang/lib/Basic/Targets/ARM.cpp b/clang/lib/Basic/Targets/ARM.cpp
index ce7e4d4639ceac..b6a984b083ac9f 100644
--- a/clang/lib/Basic/Targets/ARM.cpp
+++ b/clang/lib/Basic/Targets/ARM.cpp
@@ -133,6 +133,9 @@ void ARMTargetInfo::setArchInfo(llvm::ARM::ArchKind Kind) {
   // cache CPU related strings
   CPUAttr = getCPUAttr();
   CPUProfile = getCPUProfile();
+
+  if (ArchVersion < 7)
+    HasCheapUnalignedBitfieldAccess = false;
 }
 
 void ARMTargetInfo::setAtomic() {
@@ -350,6 +353,8 @@ ARMTargetInfo::ARMTargetInfo(const llvm::Triple &Triple,
   // zero length bitfield.
   UseZeroLengthBitfieldAlignment = true;
 
+  HasCheapUnalignedBitfieldAccess = true;
+
   if (Triple.getOS() == llvm::Triple::Linux ||
       Triple.getOS() == llvm::Triple::UnknownOS)
     this->MCountName = Opts.EABIVersion == llvm::EABI::GNU
@@ -636,6 +641,9 @@ bool ARMTargetInfo::handleTargetFeatures(std::vector<std::string> &Features,
   else if (FPMath == FP_VFP)
     Features.push_back("-neonfp");
 
+  if (!Unaligned)
+    HasCheapUnalignedBitfieldAccess = false;
+
   return true;
 }
 
diff --git a/clang/lib/Basic/Targets/LoongArch.h b/clang/lib/Basic/Targets/LoongArch.h
index 3313102492cb8d..105176ea5cd0c1 100644
--- a/clang/lib/Basic/Targets/LoongArch.h
+++ b/clang/lib/Basic/Targets/LoongArch.h
@@ -132,6 +132,7 @@ class LLVM_LIBRARY_VISIBILITY LoongArch64TargetInfo
       : LoongArchTargetInfo(Triple, Opts) {
     LongWidth = LongAlign = PointerWidth = PointerAlign = 64;
     IntMaxType = Int64Type = SignedLong;
+    HasCheapUnalignedBitfieldAccess = true;
     resetDataLayout("e-m:e-p:64:64-i64:64-i128:128-n64-S128");
     // TODO: select appropriate ABI.
     setABI("lp64d");
diff --git a/clang/lib/Basic/Targets/Mips.h b/clang/lib/Basic/Targets/Mips.h
index f46b95abfd75c7..62f774e8743534 100644
--- a/clang/lib/Basic/Targets/Mips.h
+++ b/clang/lib/Basic/Targets/Mips.h
@@ -326,6 +326,8 @@ class LLVM_LIBRARY_VISIBILITY MipsTargetInfo : public TargetInfo {
         IsMips16 = true;
       else if (Feature == "+micromips")
         IsMicromips = true;
+      else if (Feature == "+mips32r6" || Feature == "+mips64r6")
+        HasCheapUnalignedBitfieldAccess = true;
       else if (Feature == "+dsp")
         DspRev = std::max(DspRev, DSP1);
       else if (Feature == "+dspr2")
diff --git a/clang/lib/Basic/Targets/PPC.h b/clang/lib/Basic/Targets/PPC.h
index 4d62673ba7fb8c..d2f4a32796dfe9 100644
--- a/clang/lib/Basic/Targets/PPC.h
+++ b/clang/lib/Basic/Targets/PPC.h
@@ -92,6 +92,7 @@ class LLVM_LIBRARY_VISIBILITY PPCTargetInfo : public TargetInfo {
     LongDoubleFormat = &llvm::APFloat::PPCDoubleDouble();
     HasStrictFP = true;
     HasIbm128 = true;
+    HasCheapUnalignedBitfieldAccess = true;
   }
 
   // Set the language option for altivec based on our value.
diff --git a/clang/lib/Basic/Targets/SystemZ.h b/clang/lib/Basic/Targets/SystemZ.h
index e4ec338880f210..087f5b93d9a52e 100644
--- a/clang/lib/Basic/Targets/SystemZ.h
+++ b/clang/lib/Basic/Targets/SystemZ.h
@@ -45,6 +45,7 @@ class LLVM_LIBRARY_VISIBILITY SystemZTargetInfo : public TargetInfo {
     LongDoubleFormat = &llvm::APFloat::IEEEquad();
     DefaultAlignForAttributeAligned = 64;
     MinGlobalAlign = 16;
+    HasCheapUnalignedBitfieldAccess = true;
     if (Triple.isOSzOS()) {
       TLSSupported = false;
       // All vector types are default aligned on an 8-byte boundary, even if the
diff --git a/clang/lib/Basic/Targets/VE.h b/clang/lib/Basic/Targets/VE.h
index ea9a092cad8090..94bcf3d6ee0288 100644
--- a/clang/lib/Basic/Targets/VE.h
+++ b/clang/lib/Basic/Targets/VE.h
@@ -40,6 +40,7 @@ class LLVM_LIBRARY_VISIBILITY VETargetInfo : public TargetInfo {
     Int64Type = SignedLong;
     RegParmMax = 8;
     MaxAtomicPromoteWidth = MaxAtomicInlineWidth = 64;
+    HasCheapUnalignedBitfieldAccess = true;
 
     WCharType = UnsignedInt;
     WIntType = UnsignedInt;
diff --git a/clang/lib/Basic/Targets/WebAssembly.h b/clang/lib/Basic/Targets/WebAssembly.h
index 83b1711f9fdf6a..4f4caabc02677b 100644
--- a/clang/lib/Basic/Targets/WebAssembly.h
+++ b/clang/lib/Basic/Targets/WebAssembly.h
@@ -84,6 +84,7 @@ class LLVM_LIBRARY_VISIBILITY WebAssemblyTargetInfo : public TargetInfo {
     SizeType = UnsignedLong;
     PtrDiffType = SignedLong;
     IntPtrType = SignedLong;
+    HasCheapUnalignedBitfieldAccess = true;
   }
 
   StringRef getABI() const override;
diff --git a/clang/lib/Basic/Targets/X86.h b/clang/lib/Basic/Targets/X86.h
index 0ab1c10833db26..16a59c36762301 100644
--- a/clang/lib/Basic/Targets/X86.h
+++ b/clang/lib/Basic/Targets/X86.h
@@ -188,6 +188,7 @@ class LLVM_LIBRARY_VISIBILITY X86TargetInfo : public TargetInfo {
     LongDoubleFormat = &llvm::APFloat::x87DoubleExtended();
     AddrSpaceMap = &X86AddrSpaceMap;
     HasStrictFP = true;
+    HasCheapUnalignedBitfieldAccess = true;
 
     bool IsWinCOFF =
         getTriple().isOSWindows() && getTriple().isOSBinFormatCOFF();
diff --git a/clang/lib/CodeGen/CGRecordLayoutBuilder.cpp b/clang/lib/CodeGen/CGRecordLayoutBuilder.cpp
index cbfa79e10bfefc..447727771d5c50 100644
--- a/clang/lib/CodeGen/CGRecordLayoutBuilder.cpp
+++ b/clang/lib/CodeGen/CGRecordLayoutBuilder.cpp
@@ -185,7 +185,8 @@ struct CGRecordLowering {
   void lowerUnion(bool isNoUniqueAddress);
   void accumulateFields();
   void accumulateBitFields(RecordDecl::field_iterator Field,
-                           RecordDecl::field_iterator FieldEnd);
+                           RecordDecl::field_iterator FieldEnd,
+                           CharUnits Limit);
   void computeVolatileBitfields();
   void accumulateBases();
   void accumulateVPtrs();
@@ -376,33 +377,41 @@ void CGRecordLowering::lowerUnion(bool isNoUniqueAddress) {
 }
 
 void CGRecordLowering::accumulateFields() {
-  for (RecordDecl::field_iterator Field = D->field_begin(),
-                                  FieldEnd = D->field_end();
-    Field != FieldEnd;) {
+  RecordDecl::field_iterator FieldEnd = D->field_end();
+  RecordDecl::field_iterator BitField = FieldEnd;
+  for (RecordDecl::field_iterator Field = D->field_begin(); Field != FieldEnd;
+       ++Field) {
     if (Field->isBitField()) {
-      RecordDecl::field_iterator Start = Field;
-      // Iterate to gather the list of bitfields.
-      for (++Field; Field != FieldEnd && Field->isBitField(); ++Field);
-      accumulateBitFields(Start, Field);
+      if (BitField == FieldEnd)
+        // Start gathering bitfields
+        BitField = Field;
     } else if (!Field->isZeroSize(Context)) {
       // Use base subobject layout for the potentially-overlapping field,
       // as it is done in RecordLayoutBuilder
+      CharUnits Offset = bitsToCharUnits(getFieldBitOffset(*Field));
+      if (BitField != FieldEnd) {
+        // Gather the bitfields, now we know where they cannot be
+        // extended past.
+        accumulateBitFields(BitField, Field, Offset);
+        BitField = FieldEnd;
+      }
       Members.push_back(MemberInfo(
-          bitsToCharUnits(getFieldBitOffset(*Field)), MemberInfo::Field,
+          Offset, MemberInfo::Field,
           Field->isPotentiallyOverlapping()
               ? getStorageType(Field->getType()->getAsCXXRecordDecl())
               : getStorageType(*Field),
           *Field));
-      ++Field;
-    } else {
-      ++Field;
     }
   }
+  if (BitField != FieldEnd)
+    // Size as non-virtual base is what we want here.
+    accumulateBitFields(BitField, FieldEnd,
+                        RD ? Layout.getNonVirtualSize() : Layout.getDataSize());
 }
 
-void
-CGRecordLowering::accumulateBitFields(RecordDecl::field_iterator Field,
-                                      RecordDecl::field_iterator FieldEnd) {
+void CGRecordLowering::accumulateBitFields(RecordDecl::field_iterator Field,
+                                           RecordDecl::field_iterator FieldEnd,
+                                           CharUnits Limit) {
   // Run stores the first element of the current run of bitfields.  FieldEnd is
   // used as a special value to note that we don't have a current run.  A
   // bitfield run is a contiguous collection of bitfields that can be stored in
@@ -415,12 +424,16 @@ CGRecordLowering::accumulateBitFields(RecordDecl::field_iterator Field,
   uint64_t StartBitOffset, Tail = 0;
   if (isDiscreteBitFieldABI()) {
     for (; Field != FieldEnd; ++Field) {
-      uint64_t BitOffset = getFieldBitOffset(*Field);
+      if (!Field->isBitField()) {
+        assert(Field->isZeroSize(Context) && "non-zero sized non-bitfield");
+        continue;
+      }
       // Zero-width bitfields end runs.
       if (Field->isZeroLengthBitField(Context)) {
         Run = FieldEnd;
         continue;
       }
+      uint64_t BitOffset = getFieldBitOffset(*Field);
       llvm::Type *Type =
           Types.ConvertTypeForMem(Field->getType(), /*ForBitField=*/true);
       // If we don't have a run yet, or don't live within the previous run's
@@ -442,79 +455,235 @@ CGRecordLowering::accumulateBitFields(RecordDecl::field_iterator Field,
     return;
   }
 
-  // Check if OffsetInRecord (the size in bits of the current run) is better
-  // as a single field run. When OffsetInRecord has legal integer width, and
-  // its bitfield offset is naturally aligned, it is better to make the
-  // bitfield a separate storage component so as it can be accessed directly
-  // with lower cost.
-  auto IsBetterAsSingleFieldRun = [&](uint64_t OffsetInRecord,
-                                      uint64_t StartBitOffset) {
-    if (!Types.getCodeGenOpts().FineGrainedBitfieldAccesses)
-      return false;
-    if (OffsetInRecord < 8 || !llvm::isPowerOf2_64(OffsetInRecord) ||
-        !DataLayout.fitsInLegalInteger(OffsetInRecord))
-      return false;
-    // Make sure StartBitOffset is naturally aligned if it is treated as an
-    // IType integer.
-    if (StartBitOffset %
-            Context.toBits(getAlignment(getIntNType(OffsetInRecord))) !=
-        0)
-      return false;
-    return true;
+  // The SysV ABI can overlap bitfield storage units with both other bitfield
+  // storage units /and/ other non-bitfield data members. Such overlap, in the
+  // absence of packing, is always complete -- one storage unit is entirely
+  // within another. However, llvm cannot represent that -- it's structures are
+  // entirely flat. We place bitfields in 'access units', which are similar to
+  // the SysV storage units, but a clang-specific concept.
+
+  // It can be advantageous to concatenate two adjacent access units, if the
+  // concenation can be read or written in a single instruction.
+
+  // We do two passes.
+
+  // a) allocate bitfields into the smallest access units they can
+  // fit. This results in a set of integral-typed access units.
+
+  // b) concatentate mergeable access units. This applies the
+  // above-mentioned optimization, and in general, requires lookahead
+  // to know the next access unit -- not merely the next bitfield.
+
+  class AccessUnit {
+    // Which bitfields are in the access
+    RecordDecl::field_iterator First;
+    RecordDecl::field_iterator Last;
+
+    CharUnits Start; // Starting offset within the record
+    CharUnits End;   // Finish offset (exclusive) within the record
+
+    bool ContainsVolatile = false;
+
+  public:
+    AccessUnit(RecordDecl::field_iterator F, RecordDecl::field_iterator L,
+               CharUnits S, CharUnits E, bool HasVolatile = false)
+        : First(F), Last(L), Start(S), End(E), ContainsVolatile(HasVolatile) {}
+    AccessUnit(RecordDecl::field_iterator F, CharUnits Place)
+        : AccessUnit(F, F, Place, Place) {}
+
+  public:
+    auto begin() const { return First; }
+    auto end() const { return Last; }
+
+  public:
+    void MergeFrom(AccessUnit const &Earlier) {
+      First = Earlier.First;
+      Start = Earlier.Start;
+    }
+
+  public:
+    // Accessors
+    CharUnits getSize() const { return getSize(*this); }
+    CharUnits getStart() const { return Start; }
+    CharUnits getSize(const AccessUnit &NotEarlier) const {
+      return NotEarlier.End - Start;
+    }
+
+    // Setter
+    void setEnd(CharUnits E) { End = E; }
+
+    // Predicates
+    bool isBarrier() const { return getSize().isZero(); }
+    bool hasVolatile() const { return ContainsVolatile; }
+
+    bool StartsBefore(CharUnits Offset, bool NonStrict = false) const {
+      if (NonStrict)
+        // Not strictly <, permit ==
+        ++Offset;
+      return Start < Offset;
+    }
+    bool ExtendsBeyond(CharUnits Offset) const { return End > Offset; }
+    bool EndsAt(CharUnits Offset) const { return End == Offset; }
   };
+  SmallVector<AccessUnit, 8> AUs;
+  bool SeenVolatile = false;
+  CharUnits StartOffset{};
+  for (; Field != FieldEnd; ++Field) {
+    if (!Field->isBitField()) {
+      assert(Field->isZeroSize(Context) && "non-zero sized non-bitfield");
+      continue;
+    }
 
-  // The start field is better as a single field run.
-  bool StartFieldAsSingleRun = false;
-  for (;;) {
-    // Check to see if we need to start a new run.
-    if (Run == FieldEnd) {
-      // If we're out of fields, return.
-      if (Field == FieldEnd)
-        break;
-      // Any non-zero-length bitfield can start a new run.
-      if (!Field->isZeroLengthBitField(Context)) {
-        Run = Field;
-        StartBitOffset = getFieldBitOffset(*Field);
-        Tail = StartBitOffset + Field->getBitWidthValue(Context);
-        StartFieldAsSingleRun = IsBetterAsSingleFieldRun(Tail - StartBitOffset,
-                                                         StartBitOffset);
+    if (Run != FieldEnd) {
+      // Accumlating.
+      uint64_t BitOffset = getFieldBitOffset(*Field);
+      if (uint64_t(Context.toBits(bitsToCharUnits(BitOffset))) != BitOffset) {
+        // This bitfield's start is not at a char boundary. It must
+        // share an access unit with the previous bitfield.
+        assert(Tail == BitOffset && "Packing non-contiguous bitfield");
+        Tail += Field->getBitWidthValue(Context);
+        if (Field->getType().isVolatileQualified())
+          SeenVolatile = true;
+        continue;
       }
-      ++Field;
-      continue;
+
+      // End run. Move the end to the next char boundary (bTCU truncates).
+      CharUnits EndOffset = bitsToCharUnits(Tail + Context.getCharWidth() - 1);
+      AUs.emplace_back(Run, Field, StartOffset, EndOffset, SeenVolatile);
+      Run = FieldEnd;
+
+      // Fallthrough -- this field will start a new run.
     }
 
-    // If the start field of a new run is better as a single run, or
-    // if current field (or consecutive fields) is better as a single run, or
-    // if current field has zero width bitfield and either
-    // UseZeroLengthBitfieldAlignment or UseBitFieldTypeAlignment is set to
-    // true, or
-    // if the offset of current field is inconsistent with the offset of
-    // previous field plus its offset,
-    // skip the block below and go ahead to emit the storage.
-    // Otherwise, try to add bitfields to the run.
-    if (!StartFieldAsSingleRun && Field != FieldEnd &&
-        !IsBetterAsSingleFieldRun(Tail - StartBitOffset, StartBitOffset) &&
-        (!Field->isZeroLengthBitField(Context) ||
-         (!Context.getTargetInfo().useZeroLengthBitfieldAlignment() &&
-          !Context.getTargetInfo().useBitFieldTypeAlignment())) &&
-        Tail == getFieldBitOffset(*Field)) {
-      Tail += Field->getBitWidthValue(Context);
-      ++Field;
-      continue;
+    assert(Run == FieldEnd);
+    // We're starting a new run.
+    if (!Field->isZeroLengthBitField(Context) ||
+        Context.getTargetInfo().useZeroLengthBitfieldAlignment() ||
+        Context.getTargetInfo().useBitFieldTypeAlignment()) {
+      // A non-zero length field starts a run, or a zero-length one might
+      // create a barrier (ABI-dependent).
+      StartBitOffset = getFieldBitOffset(*Field);
+      StartOffset = bitsToCharUnits(StartBitOffset);
+      assert(uint64_t(Context.toBits(StartOffset)) == StartBitOffset &&
+             "Bitrun doesn't start at a char unit");
+      Tail = StartBitOffset;
+      if (Field->isZeroLengthBitField(Context))
+        // Place a stop marker
+        AUs.emplace_back(Field, StartOffset);
+      else {
+        // Start a run
+        SeenVolatile = Field->getType().isVolatileQualified();
+        Tail += Field->getBitWidthValue(Context);
+        Run = Field;
+      }
     }
+  }
+  if (Run != FieldEnd) {
+    // End run. Move the end to the next char boundary (bTCU truncates).
+    CharUnits EndOffset = bitsToCharUnits(Tail + Context.getCharWidth() - 1);
+    AUs.emplace_back(Run, Field, StartOffset, EndOffset, SeenVolatile);
+  }
 
-    // We've hit a break-point in the run and need to emit a storage field.
-    llvm::Type *Type = getIntNType(Tail - StartBitOffset);
-    // Add the storage member to the record and set the bitfield info for all of
-    // the bitfields in the run.  Bitfields get the offset of their storage but
-    // come afterward and remain there after a stable sort.
-    Members.push_back(StorageInfo(bitsToCharUnits(StartBitOffset), Type));
-    for (; Run != Field; ++Run)
-      Members.push_back(MemberInfo(bitsToCharUnits(StartBitOffset),
-                                   MemberInfo::Field, nullptr, *Run));
-    Run = FieldEnd;
-    StartFieldAsSingleRun = false;
+  if (!Types.getCodeGenOpts().FineGrainedBitfieldAccesses) {
+    // Append a stop marker -- we can extend up to this.
+    AUs.emplace_back(FieldEnd, Limit);
+
+    // Concatenate mergeable units. We can do this by only scanning forwards,
+    // finding a good point to merge and then repeating. there are two cases of
+    // interest.
+
+    // 1. Concatenating the next unit with the current accumulation results in a
+    // type no bigger than a register.
+
+    // 2. The current accumulation's underlying type extends into the next
+    // accumulation, and, iff this requires a bigger type, then its still no
+    // bigger than a register.
+
+    // If we're still nicely aligned (or unaligned accesses are ok), do the
+    // concatenation. The semantics of volatile bitfields aren't
+   ...
[truncated]

``````````

</details>


https://github.com/llvm/llvm-project/pull/65742