[PATCH] D138766: [InstCombine] If loading from small alloca, load whole alloca and perform variable extraction

Sun Nov 27 11:18:15 PST 2022

lebedev.ri created this revision.
lebedev.ri added reviewers: spatel, RKSimon, efriedma.
lebedev.ri added a project: LLVM.
Herald added a subscriber: hiraditya.
Herald added a project: All.
lebedev.ri requested review of this revision.

Ok, beforehand, i must admit, this fold is esoteric,
does not really fit into InstCombine,
and we don't have any reasonable place for it.

https://discourse.llvm.org/t/where-could-we-perform-sudo-sroa-with-non-constant-offsets/66645

First, consider: (example 0)

  #include <cstdlib>
  #include <cstring>
  #include <algorithm>

  void sink(char*);

  constexpr int size = 4;

  void entry(char* input, int length, int position) {
      int max_bytes = length - position;
      int bytes = std::min(size, max_bytes);
      char res[size] = {};
      memcpy(res, input + position, bytes);
      sink(res);
  }

This function has to perform dynamically-sized, but bound, `memcpy`,
which may or may not be good, given particular use case: https://godbolt.org/z/3cd6bvEq5

Now, let's look at another variant (not strictly identical): (example 1)

  #include <cstdlib>
  #include <cstring>
  #include <algorithm>

  void sink(char*);

  constexpr int size = 4;

  void entry(char* input, int length, int position) {
      int last_pos = length - size;
      int clamped_pos = std::min(position, last_pos);
      char tmp[2 * size] = {};
      memcpy(tmp, &input + clamped_pos, size);
      int num_leading_padding_bytes = std::min(size, position - clamped_pos);
      char res[size] = {};
      memcpy(res, tmp + num_leading_padding_bytes, size);
      sink(res);
  }

Here, both memory loads are statically-sized.
Under some external preconditions, that are not relevant here,
the examples are equivalent.

Problem is, the second `memcpy` loads from a non-constant offset into `tmp`,
SROA does not deal with non-constant offsets, so we end up with `tmp`
not being promoted into a register: https://godbolt.org/z/ebPrrjaa6

So while this may or may not already be better than the original variant,
this is still not great. This can come up in hot paths, e.g. (example 0) is
https://github.com/darktable-org/rawspeed/blob/6be00ea43b92c876692593436f9edbbf70d4c3d4/src/librawspeed/io/BitStream.h#L145-L173
and i was in procees of improving it into (example 1) but got stuck on performance.

The transformation itself isn't that complicated,
we just don't have a great place for it.
I've added hopefully sufficiently exhaustive test coverage,
and verified it with alive.

Now, huge caveat: indeed, this needs a profitability check.
Profitability reasoning: we expect that for the largest legal int type, we
do have good support for variable-amount shifts. For the type 2x that
width, the legalization will expand the shift into, at worst, 3 legal-sized
shifts + 5 supporting ALU ops. We expect that such an expansion is still
not worse than the original pattern we have matched here.
But for any bit width larger than that, this isn't worth it.
Codegen for i128 case: https://alive2.llvm.org/ce/z/Tu85qE

I think, this is pretty uncontentious for largest legal integer,
but unfortunately i'm interested in "load i32 from i64" **and** "load i64 from i128" :)

Thoughts?

Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D138766

Files:
  llvm/lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp
  llvm/test/Transforms/InstCombine/widen-load-of-small-alloca.ll

-------------- next part --------------
A non-text attachment was scrubbed...
Name: D138766.478111.patch
Type: text/x-patch
Size: 78125 bytes
Desc: not available
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20221127/cc90ca4a/attachment.bin>