[llvm-commits] PATCH: A new SROA implementation

Thu Sep 13 13:32:08 PDT 2012

Hi Chandler,

On 13/09/12 12:15, Chandler Carruth wrote:
> Hello all,
>
> Here is an updated patch. It contains all of the fixes I discussed here, and it
> also contains some significant updates:
>
> - Now has additionally survived intensive testing by Duncan. All crashers found
> with all the dragonegg test suites (including many ada tests etc) are fixed. The
> initial and immediate miscompile issues fixed. It's looking pretty good testing
> wise.
>
> - *Many* test cases added. Lots of bugs fixed. Generally, quite a bit of
> hardening here. Duncan found several soft areas that I then was able to more
> thoroughly exercise.
>
>
> I think this is getting *really* close to being ready for an initial commit. It
> is still behind a default-off flag. It will stay there for a bit. In particular
> I want to get more testing, I want to implement SSAUpdater-logic and tie it into
> the rest of the pass manager and collect final performance results before
> enabling it. But I don't want to keep hacking on this out of tree. =]
>
> Anyways, review greatly appreciated.

looks great to me, I think you should apply it.  Some small comments:

> --- /dev/null
> +++ b/lib/Transforms/Scalar/SROA.cpp
...
> +namespace {
> +/// \brief Alloca partitioning representation.
> +///
> +/// This class represents a partitioning of an alloca into slices, and
> +/// information about the nature of uses of each slice of the alloca. The goal
> +/// is that this information is sufficient to decide if and how to split the
> +/// alloca apart and replace slices with scalars. It is also intended that this
> +/// structure can capture the relevant information needed both due decide about

due decide -> to decide

> +/// and to enact these transformations.
> +class AllocaPartitioning {
> +public:
> +  /// \brief A common base class for representing a half-open byte range.
> +  struct ByteRange {
> +    /// \brief The beginning offset of the range.
> +    uint64_t BeginOffset;
> +
> +    /// \brief The ending offset, not included in the range.
> +    uint64_t EndOffset;
> +
> +    ByteRange() : BeginOffset(), EndOffset() {}
> +    ByteRange(uint64_t BeginOffset, uint64_t EndOffset)
> +        : BeginOffset(BeginOffset), EndOffset(EndOffset) {}
> +
> +    /// \brief Support for ordering ranges.
> +    ///
> +    /// This provides an ordering over ranges such that start offsets are
> +    /// always increasing, and within equal start offsets, the end offsets are
> +    /// decreasing. Thus the spanning range comes first in in cluster with the

in in -> in a

> +    /// same start position.
...
> +  /// \brief A partition of an alloca.
> +  ///
> +  /// This structure represents a contiguous partition of the alloca. These are
> +  /// formed by examining the uses of the alloca. During formation, they may
> +  /// overlap but once an AllocaPartitioning is built, the Partitions within it
> +  /// are all disjoint.
> +  struct Partition : public ByteRange {
> +    /// \brief Whether this partition is splittable into smaller partitions.
> +    ///
> +    /// We flag partitions as splittable when they are formed entirely due to
> +    /// accesses by trivially split operations such as memset and memcpy.

trivially split -> trivially splittable

> +    ///
> +    /// FIXME: At some point we should consider loads and stores of FCAs to be
> +    /// splittable and eagerly split them into scalar values.
> +    bool IsSplittable;
> +
> +    Partition() : ByteRange(), IsSplittable() {}
> +    Partition(uint64_t BeginOffset, uint64_t EndOffset, bool IsSplittable)
> +        : ByteRange(BeginOffset, EndOffset), IsSplittable(IsSplittable) {}
> +  };
> +
> +  /// \brief A particular use of a partition of the alloca.
> +  ///
> +  /// This structure is used to associate uses of a partition with it. They
> +  /// mark the range of bytes which are referenced by a particular instruction,
> +  /// and includes a handle to the user itself and the pointer value in use.
> +  /// The bounds of these uses are determined by intersecting the bounds of the
> +  /// memory use itself with a particular partition. As a consequence there is
> +  /// intentionally overlap between various usues of the same partition.

usues -> uses

...

> +  /// \brief Allow iterating the dead users for this alloca.
> +  ///
> +  /// These are instructions which will never actually use the alloca as they
> +  /// are outside the allocated range. They are safe to replace with undef and
> +  /// delete.

If you were feeling bold you could replace them by "unreachable" (in this
context probably a store to undef would be more appropriate).

> +  /// @{
> +  typedef SmallVectorImpl<Instruction *>::const_iterator dead_user_iterator;
> +  dead_user_iterator dead_user_begin() const { return DeadUsers.begin(); }
> +  dead_user_iterator dead_user_end() const { return DeadUsers.end(); }
> +  /// @}
> +
> +  /// \brief Allow iterating the dead operands referring to this alloca.

Maybe "dead expressions" would be clearer.  These are just some addresses
computed from the alloca address, right?

...
> +  /// \brief Compute a common type among the uses of a particular partition.
> +  ///
> +  /// This routines walks all of the uses of a particular partition and tries
> +  /// to find a common type between them. Untyped operations such as memset and
> +  /// memcpy are ignored.
> +  Type *getCommonType(iterator I) const;
> +
> +  void print(raw_ostream &OS, const_iterator I, StringRef Indent = "  ") const;
> +  void printUsers(raw_ostream &OS, const_iterator I,
> +                  StringRef Indent = "  ") const;
> +  void print(raw_ostream &OS) const;
> +  void dump(const_iterator I) const LLVM_ATTRIBUTE_NOINLINE LLVM_ATTRIBUTE_USED;
> +  void dump() const LLVM_ATTRIBUTE_NOINLINE LLVM_ATTRIBUTE_USED;

A patch just went in to wrap "dump" functions in ifndef NDEBUG (plus some other
condition), maybe you should do the same here.

> +  /// \brief The partitions of the alloca.
> +  ///
> +  /// We store a vector of the partitions over the alloca here. This vector is
> +  /// sorted by increasing begin offset, and then by decreasing end offset. See
> +  /// the Partition inner class for more details. Initially there are overlaps,
> +  /// be during construction we form a disjoint sequence toward the end.

be during -> ?
I couldn't make much sense of this sentence.

...
> +  bool computeConstantGEPOffset(GetElementPtrInst &GEPI, uint64_t &GEPOffset) {

Is this routine really not available from some other part of LLVM?

> +  Value *foldSelectInst(SelectInst &SI) {
> +    // If the condition being selected on is a constant or the same value is
> +    // being selected between, fold the select. Yes this does (rarely) happen
> +    // early on.
> +    if (ConstantInt *CI = dyn_cast<ConstantInt>(SI.getCondition()))
> +      return SI.getOperand(1+CI->isZero());
> +    if (SI.getOperand(1) == SI.getOperand(2)) {
> +      assert(*U == SI.getOperand(1));

Urgh, what is the *U that turns up here?  Seems like breaking the abstraction.

> +      return SI.getOperand(1);
> +    }
> +    return 0;
> +  }
> +};

...

> +  bool handleLoadOrStore(Type *Ty, Instruction &I) {
> +    uint64_t Size = TD.getTypeStoreSize(Ty);
> +
> +    // If this memory access can be shown to *statically* extend outside the
> +    // bounds of of the allocation, it's behavior is undefined, so simply
> +    // ignore it. Note that this is more strict than the generic clamping
> +    // behavior of insertUse. We also try to handle cases which might run the
> +    // risk of overflow.
> +    // FIXME: We should instead consider the pointer to have escaped if this
> +    // function is being instrumented for addressing bugs or race conditions.
> +    if (Offset >= AllocSize || Size > AllocSize || Offset + Size > AllocSize) {

Such accesses could be replaced by "unreachable".

> +  bool visitGetElementPtrInst(GetElementPtrInst &GEPI) {
> +    //unsigned IntPtrWidth = TD->getPointerSizeInBits();
> +    //assert(IntPtrWidth == Offset.getBitWidth());

Commented out code above...

> +    uint64_t GEPOffset;
> +    if (!computeConstantGEPOffset(GEPI, GEPOffset))
> +      return markAsEscaping(GEPI);
> +
> +    enqueueUsers(GEPI, GEPOffset);
> +    return true;
> +  }

...
> +  bool visitMemSetInst(MemSetInst &II) {

It is impossible for the use to be the size or the byte argument, right, because
you only deal with expressions of pointer type?  But if someone adds ptrtoint
support one day then you could be in trouble here, so better to check...

> +    ConstantInt *Length = dyn_cast<ConstantInt>(II.getLength());
> +    insertUse(II, Length ? Length->getZExtValue() : AllocSize - Offset, Length);
> +    return true;
> +  }

...
> +  // Disable SRoA for any intrinsics except for lifetime invariants.

What about debug intrinsics?

...

> +/// \brief Use adder for the alloca partitioning.
> +///
> +/// This class adds the uses of an alloca to all of the partitions which it
> +/// uses.

Maybe should read: This class adds the uses of an alloca to all of the
partitions which they use.

  For splittable partitions, this can end up doing essentially a linear
> +/// walk of the partitions, but the number of steps remains bounded by the
> +/// total result instruction size:
> +/// - The number of partitions is a result of the number unsplittable
> +///   instructions using the alloca.
> +/// - The number of users of each partition is at worst the total number of
> +///   splittable instructions using the alloca.
> +/// Thus we will produce N * M instructions in the end, where N are the number
> +/// of unsplittable uses and M are the number of splittable. This visitor does
> +/// the exact same number of updates to the partitioning.
> +///
> +/// In the more common case, this visitor will leverage the fact that the
> +/// partition space is pre-sorted, and do a logarithmic search for the
> +/// partition needed, making the total visit a classical ((N + M) * log(N))
> +/// complexity operation.

...

> +  void visitGetElementPtrInst(GetElementPtrInst &GEPI) {
> +    if (GEPI.use_empty())
> +      return markAsDead(GEPI);
> +
> +    //unsigned IntPtrWidth = TD->getPointerSizeInBits();
> +    //assert(IntPtrWidth == Offset.getBitWidth());

Commented out code...

> +    uint64_t GEPOffset;
> +    if (!computeConstantGEPOffset(GEPI, GEPOffset))
> +      llvm_unreachable("Unable to compute constant offset for use");
> +
> +    enqueueUsers(GEPI, GEPOffset);
> +  }

...

> +  /// \brief A set of deleted alloca instructions.
> +  ///
> +  /// These pointers are *no longer valid* as they have been deleted. They are
> +  /// used to remove deleted allocas from the list of promotable allocas.
> +  SmallPtrSet<AllocaInst *, 4> DeletedAllocas;

Does this have to live as a member variable?  It is only used in a very
localized way: it could be declared in runOnFunction and passed to
deleteDeadInstructions to fill in.  What is dangerous about this kind of set
is that you mustn't allocate any new alloca's after adding something to the
set (and before clearing the set) since it might be allocated with the address
of the dead alloca, causing confusion and non-determinism.  Keeping the set
highly local makes it obvious that this can't happen.

> +
> +  /// \brief A collection of alloca instructions we can directly promote.
> +  std::vector<AllocaInst *> PromotableAllocas;

...

> +/// \brief Accumulate the constant offsets in a GEP into a single APInt offset.
> +///
> +/// If the provided GEP is all-constant, the total byte offset formed by the
> +/// GEP is computed and Offset is set to it. If the GEP has any non-constant
> +/// operands, the function returns false and the value of Offset is unmodified.
> +static bool accumulateGEPOffsets(const TargetData &TD, GEPOperator &GEP,
> +                                 APInt &Offset) {

This also sounds like the kind of routine that maybe exists already (or should
exist already).

...
> +/// \brief Recursively compute indices for a natural GEP.
> +///
> +/// This is the recursive step for getNaturalGEPWithOffset that walks down the
> +/// element types adding appropriate indices for the GEP.
> +static Value *getNaturalGEPRecursively(IRBuilder<> &IRB, const TargetData &TD,
> +                                       Value *Ptr, Type *Ty, APInt &Offset,
> +                                       Type *TargetTy,
> +                                       SmallVectorImpl<Value *> &Indices,
> +                                       const Twine &Prefix) {
> +  if (Offset == 0)
> +    return getNaturalGEPWithType(IRB, TD, Ptr, Ty, TargetTy, Indices, Prefix);
> +
> +  // We can't recurse through pointer types.
> +  if (Ty->isPointerTy())
> +    return 0;
> +
> +  if (VectorType *VecTy = dyn_cast<VectorType>(Ty)) {
> +    unsigned ElementSizeInBits = VecTy->getScalarSizeInBits();
> +    if (ElementSizeInBits % 8)
> +      return 0; // GEPs over multiple of 8 size vector elements are invalid.

over multiple -> over non-multiple
You invented a new rule here, maybe add a comment that this whole area is a can
of worms and not properly defined right now.

> +    APInt ElementSize(Offset.getBitWidth(), ElementSizeInBits / 8);
> +    APInt NumSkippedElements = Offset.udiv(ElementSize);
> +    if (NumSkippedElements.ugt(VecTy->getNumElements()))
> +      return 0;
> +    Offset -= NumSkippedElements * ElementSize;
> +    Indices.push_back(IRB.getInt(NumSkippedElements));
> +    return getNaturalGEPRecursively(IRB, TD, Ptr, VecTy->getElementType(),
> +                                    Offset, TargetTy, Indices, Prefix);
> +  }

Above you seem to be assuming that the elements are naturally aligned, i.e. have
alloc size equal to the size in bits.  I'm not sure that is guaranteed.

...

> +  StructType *STy = dyn_cast<StructType>(Ty);
> +  if (!STy)
> +    return 0;
> +
> +  const StructLayout *SL = TD.getStructLayout(STy);
> +  uint64_t StructOffset = Offset.getZExtValue();
> +  if (StructOffset > SL->getSizeInBytes())

Probably this should be >= here.

> +    return 0;
> +  unsigned Index = SL->getElementContainingOffset(StructOffset);
> +  Offset -= APInt(Offset.getBitWidth(), SL->getElementOffset(Index));
> +  Type *ElementTy = STy->getElementType(Index);
> +  if (Offset.uge(TD.getTypeAllocSize(ElementTy)))
> +    return 0; // The offset points into alignment padding.
> +
> +  Indices.push_back(IRB.getInt32(Index));
> +  return getNaturalGEPRecursively(IRB, TD, Ptr, ElementTy, Offset, TargetTy,
> +                                  Indices, Prefix);
> +}

...

> +
> +/// \brief Get a natural GEP from a base pointer to a particular offset and
> +/// resulting in a particular type.
> +///
> +/// The goal is to produce a "natural" looking GEP that works with the existing
> +/// composite types to arrive at the appropriate offset and element type for
> +/// a pointer. TargetTy is the element type the returned GEP should point-to if
> +/// possible. We recurse by decreasing Offset, adding the appropriate index to
> +/// Indices, and setting Ty to the result subtype.
> +///
> +/// If no natural GEP can be constructed, this function returns a null Value*.

returns a null Value* -> returns null

...

> +/// \brief Test whether the given alloca partition can be promoted to a vector.
> +///
> +/// This is a quick test to check whether we can rewrite a particular alloca
> +/// partition (and its newly formed alloca) into a vector alloca with only
> +/// whole-vector loads and stores such that it could be promoted to a vector
> +/// SSA value. We only can ensure this for a limited set of operations, and we
> +/// don't want to do the rewrites unless we are confident that the result will
> +/// be promotable, so we have an early test here.
> +static bool isVectorPromotionViable(const TargetData &TD,
> +                                    Type *AllocaTy,
> +                                    AllocaPartitioning &P,
> +                                    uint64_t PartitionBeginOffset,
> +                                    uint64_t PartitionEndOffset,
> +                                    AllocaPartitioning::const_use_iterator I,
> +                                    AllocaPartitioning::const_use_iterator E) {
> +  VectorType *Ty = dyn_cast<VectorType>(AllocaTy);
> +  if (!Ty)
> +    return false;
> +
> +  uint64_t VecSize = TD.getTypeSizeInBits(Ty);
> +  uint64_t ElementSize = Ty->getScalarSizeInBits();
> +
> +  // While the definition of LLVM vectors is bitpacked, we don't support sizes
> +  // that aren't byte sized.
> +  if (ElementSize % 8)
> +    return false;

You should probably also bail out if ElementSize is not the same as the alloc
size, since no-one cares about those and they are very easy to get wrong.  In
fact that would make the % 8 test redundant.  I reckon you should replace all
of your % 8 tests everywhere with a utility function that checks that the alloc
size and the bit size coincide.

...
> +/// \brief Try to find a partition of the aggregate type passed in for a given
> +/// offset and size.
> +///
> +/// This recurses through the aggregate type and tries to compute a subtype
> +/// based on the offset and size. When the offset and size span a sub-section
> +/// of an array, it will even compute a new array type for that sub-section.
> +static Type *getTypePartition(const TargetData &TD, Type *Ty,
> +                              uint64_t Offset, uint64_t Size) {
> +  if (Offset == 0 && TD.getTypeAllocSize(Ty) == Size)
> +    return Ty;
> +
> +  if (SequentialType *SeqTy = dyn_cast<SequentialType>(Ty)) {
> +    // We can't partition pointers...
> +    if (SeqTy->isPointerTy())
> +      return 0;
> +
> +    Type *ElementTy = SeqTy->getElementType();
> +    uint64_t ElementSize = TD.getTypeAllocSize(ElementTy);
> +    uint64_t NumSkippedElements = Offset / ElementSize;
> +    if (ArrayType *ArrTy = dyn_cast<ArrayType>(SeqTy))
> +      if (NumSkippedElements >= ArrTy->getNumElements())
> +        return 0;
> +    if (VectorType *VecTy = dyn_cast<VectorType>(SeqTy))
> +      if (NumSkippedElements >= VecTy->getNumElements())
> +        return 0;
> +    Offset -= NumSkippedElements * ElementSize;
> +
> +    // First check if we need to recurse.
> +    if (Offset > 0 || Size < ElementSize) {
> +      // Bail if the partition ends in a different array element.
> +      if ((Offset + Size) > ElementSize)
> +        return 0;
> +      // Recurse through the element type trying to peel off offset bytes.
> +      return getTypePartition(TD, ElementTy, Offset, Size);
> +    }
> +    assert(Offset == 0);
> +
> +    if (Size == ElementSize)
> +      return ElementTy;
> +    assert(Size > ElementSize);
> +    uint64_t NumElements = Size / ElementSize;
> +    if (NumElements * ElementSize != Size)
> +      return 0;
> +    return ArrayType::get(ElementTy, NumElements);
> +  }
> +
> +  StructType *STy = dyn_cast<StructType>(Ty);
> +  if (!STy)
> +    return 0;
> +
> +  const StructLayout *SL = TD.getStructLayout(STy);
> +  if (Offset > SL->getSizeInBytes())
> +    return 0;
> +  uint64_t EndOffset = Offset + Size;
> +  if (EndOffset > SL->getSizeInBytes())
> +    return 0;
> +
> +  unsigned Index = SL->getElementContainingOffset(Offset);
> +  if (SL->getElementOffset(Index) != Offset)
> +    return 0; // Inside of padding.
> +  Offset -= SL->getElementOffset(Index);
> +
> +  Type *ElementTy = STy->getElementType(Index);
> +  uint64_t ElementSize = TD.getTypeAllocSize(ElementTy);
> +  if (Offset >= ElementSize)
> +    return 0; // The offset points into alignment padding.
> +
> +  // See if any partition must be contained by the element.
> +  if (Offset > 0 || Size < ElementSize) {
> +    if ((Offset + Size) > ElementSize)
> +      return 0;
> +    // Bail if this is a poniter element, we can't recurse through them.
> +    if (ElementTy->isPointerTy())
> +      return 0;

Why handle this special case here, since getTypePartition will take care of it.

> +    return getTypePartition(TD, ElementTy, Offset, Size);
> +  }
> +  assert(Offset == 0);
> +
> +  if (Size == ElementSize)
> +    return ElementTy;
> +
> +  StructType::element_iterator EI = STy->element_begin() + Index,
> +                               EE = STy->element_end();
> +  if (EndOffset < SL->getSizeInBytes()) {
> +    unsigned EndIndex = SL->getElementContainingOffset(EndOffset);

Shouldn't this be EndOffset - 1?  (In which case it should be <= above).

> +    if (Index == EndIndex)
> +      return 0; // Within a single element and its padding.
> +    assert(Index < EndIndex);
> +    assert(Index + EndIndex <= STy->getNumElements());
> +    EE = STy->element_begin() + EndIndex;
> +  }
> +
> +  // Try to build up a sub-structure.
> +  SmallVector<Type *, 4> ElementTys;
> +  do {
> +    ElementTys.push_back(*EI++);
> +  } while (EI != EE);
> +  StructType *SubTy = StructType::get(STy->getContext(), ElementTys,
> +                                      STy->isPacked());
> +  const StructLayout *SubSL = TD.getStructLayout(SubTy);
> +  if (Size == SubSL->getSizeInBytes())
> +    return SubTy;
> +
> +  // FIXME: We could potentially recurse down through the last element in the
> +  // sub-struct to find a natural end point.
> +  return 0;
> +}
> +
> +/// \brief Rewrite an alloca partition's users.
> +///
> +/// This routine drives both of the rewriting goals of the SROA pass. It tries
> +/// to rewrite uses of an alloca partition to be conducive for SSA value
> +/// promotion. If the partition needs a new, more refined alloca, this will
> +/// build that new alloca, preserving as much type information as possible, and
> +/// rewrite the uses of the old alloca to point at the new one and have the
> +/// appropriate new offsets. It also evaluates how successful the rewrite was
> +/// at enabling promotion and if it was successful queues the alloca to be
> +/// promoted.
> +bool SROA::rewriteAllocaPartition(AllocaInst &AI,
> +                                  AllocaPartitioning &P,
> +                                  AllocaPartitioning::iterator PI) {
> +  uint64_t AllocaSize = PI->EndOffset - PI->BeginOffset;
> +  if (P.use_begin(PI) == P.use_end(PI))
> +    return false; // No live uses left of this partition.
> +
> +  // Try to compute a friendly type for this partition of the alloca. This
> +  // won't always succeed, in which case we fall back to a legal integer type
> +  // or an i8 array of an appropriate size.
> +  Type *AllocaTy = 0;
> +  if (Type *PartitionTy = P.getCommonType(PI))
> +    if (TD->getTypeAllocSize(PartitionTy) >= AllocaSize)
> +      AllocaTy = PartitionTy;
> +  if (!AllocaTy)
> +    if (Type *PartitionTy = getTypePartition(*TD, AI.getAllocatedType(),
> +                                             PI->BeginOffset, AllocaSize))
> +      AllocaTy = PartitionTy;
> +  if ((!AllocaTy ||
> +       (AllocaTy->isArrayTy() &&
> +        AllocaTy->getArrayElementType()->isIntegerTy())) &&
> +      TD->isLegalInteger(AllocaSize * 8))
> +    AllocaTy = Type::getIntNTy(*C, AllocaSize * 8);
> +  if (!AllocaTy)
> +    AllocaTy = ArrayType::get(Type::getInt8Ty(*C), AllocaSize);

How about an assert here that the alloc size of AllocaTy is >= AllocaSize.

> +  // Check for the case where we're going to rewrite to a new alloca of the
> +  // exact same type as the original, and with the same access offsets. In that
> +  // case, re-use the existing alloca, but still run through the rewriter to
> +  // performe phi and select speculation.
> +  AllocaInst *NewAI;
> +  if (AllocaTy == AI.getAllocatedType()) {
> +    assert(PI->BeginOffset == 0 &&
> +           "Non-zero begin offset but same alloca type");
> +    assert(PI == P.begin() && "Begin offset is zero on later partition");
> +    NewAI = &AI;
> +  } else {
> +    // FIXME: The alignment here is overly conservative -- we could in many
> +    // cases get away with much weaker alignment constraints.
> +    NewAI = new AllocaInst(AllocaTy, 0, AI.getAlignment(),
> +                           AI.getName() + ".sroa." + Twine(PI - P.begin()),
> +                           &AI);
> +    ++NumNewAllocas;
> +  }
> +
> +  DEBUG(dbgs() << "Rewriting alloca partition "
> +               << "[" << PI->BeginOffset << "," << PI->EndOffset << ") to: "
> +               << *NewAI << "\n");
> +
> +  AllocaPartitionRewriter Rewriter(*TD, P, PI, *this, AI, *NewAI,
> +                                   PI->BeginOffset, PI->EndOffset);
> +  DEBUG(dbgs() << "  rewriting ");
> +  DEBUG(P.print(dbgs(), PI, ""));
> +  if (Rewriter.visitUsers(P.use_begin(PI), P.use_end(PI))) {
> +    DEBUG(dbgs() << "  and queuing for promotion\n");
> +    PromotableAllocas.push_back(NewAI);
> +  } else if (NewAI != &AI) {
> +    // If we can't promote the alloca, iterate on it to check for new
> +    // refinements exposed by splitting the current alloca. Don't iterate on an
> +    // alloca which didn't actually change and didn't get promoted.
> +    Worklist.insert(NewAI);
> +  }
> +  return true;
> +}

...

Ciao, Duncan.