[PATCH] D140493: [SROA] Support promotion in presence of variably-indexed loads

Wed Dec 21 12:41:54 PST 2022

lebedev.ri created this revision.
lebedev.ri added reviewers: arsenm, spatel, efriedma, nikic.
lebedev.ri added a project: LLVM.
Herald added subscribers: arphaman, hiraditya.
Herald added a project: All.
lebedev.ri requested review of this revision.
Herald added a subscriber: wdng.

This originally started out as an InstCombine patch,
but it does not really fit there: https://reviews.llvm.org/D138766

Here, when we encounter a load with variable index,
instead of immediately bailing, we bail only if profitability check says it's not worth it,
and otherwise record it, and rewrite it iff we don't bail for other reasons.
Rewrite is straight-forward - just load whole alloca, and extract the required bit sequence.
Calculating the equivalent bit position is a bit ugly, but any approach i would have chosen
would be called ugly. Currently i just mimic the original address computation stack,
but on a `null` pointer, and then `ptrtoint` it. The alternative is to expand GEP's ourselves.

https://discourse.llvm.org/t/where-could-we-perform-sudo-sroa-with-non-constant-offsets/66645

First, consider: (example 0)

  #include <cstdlib>
  #include <cstring>
  #include <algorithm>

  void sink(char*);

  constexpr int size = 4;

  void entry(char* input, int length, int position) {
      int max_bytes = length - position;
      int bytes = std::min(size, max_bytes);
      char res[size] = {};
      memcpy(res, input + position, bytes);
      sink(res);
  }

This function has to perform dynamically-sized, but bound, `memcpy`,
which may or may not be good, given particular use case: https://godbolt.org/z/3cd6bvEq5

Now, let's look at another variant (not strictly identical): (example 1)

  #include <cstdlib>
  #include <cstring>
  #include <algorithm>

  void sink(char*);

  constexpr int size = 4;

  void entry(char* input, int length, int position) {
      int last_pos = length - size;
      int clamped_pos = std::min(position, last_pos);
      char tmp[2 * size] = {};
      memcpy(tmp, &input + clamped_pos, size);
      int num_leading_padding_bytes = std::min(size, position - clamped_pos);
      char res[size] = {};
      memcpy(res, tmp + num_leading_padding_bytes, size);
      sink(res);
  }

Here, both memory loads are statically-sized.
Under some external preconditions, that are not relevant here,
the examples are equivalent.

Problem is, the second `memcpy` loads from a non-constant offset into `tmp`,
SROA does not deal with non-constant offsets, so we end up with `tmp`
not being promoted into a register: https://godbolt.org/z/ebPrrjaa6

So while this may or may not already be better than the original variant,
this is still not great. This can come up in hot paths, e.g. (example 0) is
https://github.com/darktable-org/rawspeed/blob/6be00ea43b92c876692593436f9edbbf70d4c3d4/src/librawspeed/io/BitStream.h#L145-L173
and i was in procees of improving it into (example 1) but got stuck on performance.

The transformation itself isn't that complicated,
we just don't have a great place for it.
I've added hopefully sufficiently exhaustive test coverage,
and verified it with alive.

Now, huge caveat: indeed, this needs a profitability check.
Profitability reasoning: we expect that for the largest legal int type, we
do have good support for variable-amount shifts. For the type 2x that
width, the legalization will expand the shift into, at worst, 3 legal-sized
shifts + 5 supporting ALU ops. We expect that such an expansion is still
not worse than the original pattern we have matched here.
But for any bit width larger than that, this isn't worth it.
Codegen for true i128 case: https://alive2.llvm.org/ce/z/Tu85qE

I think, this is pretty uncontentious for largest legal integer,
but unfortunately i'm interested in "load i32 from i64" **and** "load i64 from i128" :)
Sliver-lining: in the case i'm looking at, the upper half of the alloca is always zeros,
so after SROA, this becomes: https://alive2.llvm.org/ce/z/FgRHaZ,
and now that i128 codegen is rather good, isn't it?

Thoughts?

Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D140493

Files:
  llvm/lib/Transforms/Scalar/SROA.cpp
  llvm/test/Transforms/SROA/widen-load-of-small-alloca.ll

-------------- next part --------------
A non-text attachment was scrubbed...
Name: D140493.484648.patch
Type: text/x-patch
Size: 96689 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20221221/e180c778/attachment-0001.bin>