[PATCH] D67161: [clang,ARM] Initial ACLE intrinsics for MVE.

Fri Oct 4 08:29:55 PDT 2019

simon_tatham marked 20 inline comments as done.
simon_tatham added inline comments.

================
Comment at: clang/include/clang/Basic/DiagnosticSemaKinds.td:8529
   "argument should be a multiple of %0">;
+def err_argument_not_power_of_2 : Error<
+  "argument should be a power of 2">;
----------------
dmgreen wrote:
> Do we have any/should we have some tests for these errors?
By the time we finish implementing all the intrinsics, there will be tests for these errors. The intrinsics that need them aren't in this preliminary commit, though.

================
Comment at: clang/include/clang/Basic/arm_mve_defs.td:266
+// vidupq_wb_u16 -> vidupq_u16.
+def PNT_WB:     PolymorphicNameType<0, "wb">;
+
----------------
dmgreen wrote:
> These are not used yet, right? They are not meant for the vldr gathers, those don't have polymorphic names (if I'm reading this list of intrinsics right). These are for vidup's as the comment says?
I've defined here the complete list of polymorphic name types that will be needed for //all// the MVE intrinsics. I haven't included an example of every single one in this preliminary set, though.

Yes, I think `PNT_WB` in particular is only used for the `vidup` family.

================
Comment at: clang/test/CodeGen/arm-mve-intrinsics/scalar-shifts.c:2
+// NOTE: Assertions have been autogenerated by utils/update_cc_test_checks.py
+// RUN: %clang --target=arm-arm-none-eabi -march=armv8.1m.main+mve.fp -mfloat-abi=hard -O3 -S -emit-llvm -o - %s | FileCheck %s
+// RUN: %clang --target=arm-arm-none-eabi -march=armv8.1m.main+mve.fp -mfloat-abi=hard -DPOLYMORPHIC -O3 -S -emit-llvm -o - %s | FileCheck %s
----------------
dmgreen wrote:
> These tests all run -O3, the entire pass pipeline. I see quite a few tests in the same folder do the same thing, but in this case we will be adding quite a few tests. Random mid-end optimizations may well keep on altering them.
> 
> Is it possible to use -disable-O0-optnone and pipe them into opt -mem2reg? What would that do to the codegen, would it be a lot more verbose than it is now?
> 
> Also could/should they be using clang_cc1?
The immediate problem is that if you use any form of `clang | opt | FileCheck` command line, then `update_cc_test_checks.py` says 'skipping non-FileChecked line', because it doesn't support anything more complicated than a two-stage pipeline consisting of clang and FileCheck.

I've enhanced `update_cc_test_checks` to handle that case, in D68406, and respun these tests accordingly. But if that patch doesn't get approval then I'll have to rethink this again.

(For `vld24.c` I ended up running `opt -sroa -early-cse` to avoid the IR becoming noticeably more verbose. The rest worked fine with just `mem2reg`, though.)

Patching `update_cc_test_checks.py` to support more complex pipelines, it seems to work OK: most codegen changes are trivial ones, such as modifiers not appearing on IR operations (`call` instead of `tail call`, plain `shl` in place of `shl nuw`). Only `vld24` becomes significantly more complicated: for that one file I had to run `opt -mem2reg -sroa -early-cse` instead.

================
Comment at: clang/utils/TableGen/MveEmitter.cpp:82
+#if 0
+} // stop emacs from wanting to auto-indent everything to 2 spaces inside here
+#endif
----------------
dmgreen wrote:
> Is this needed still? It seems like something that should be handled in clang-format/emacs directly.
Oh yes, now I look on Stack Overflow there is a way to make emacs stop indenting things inside namespaces. Thanks for the prod :-)

================
Comment at: clang/utils/TableGen/MveEmitter.cpp:1403
+        " *\n"
+        " * Permission is hereby granted, free of charge, to any person "
+        "obtaining a copy\n"
----------------
dmgreen wrote:
> Clang format really made a mess of this, didn't it.
> 
> Is this is old license? Should it be updated to the new one. I imagine that these generated headers might well have been missed in the switchover.
Seems likely! I've updated it to the same license that's at the top of this source file itself.

================
Comment at: clang/utils/TableGen/MveEmitter.cpp:132
+    // the pointee type.
+    Pointer,
+  };
----------------
dmgreen wrote:
> The gathers are really a Vector of Pointers, in a way. But in the intrinsics (and IR), they are just treated as a vector of ints, so I presume a new type is not needed? We may (but not yet) want to make use of the llvm masked gathers. We would have to add codegen support for them first though (which I don't think we have plans for in the near term).
Yes, I'm working so far on the assumption that we don't need to represent scatter/gather address vectors as a special vector-of-pointers type.

(If nothing else, it would be a bit strange for the 64-bit versions, where the vector element isn't even the same //size// as a pointer.)

If auto-generation of gather loads during vectorization turns out to need a special representation, then I suppose we'll have to rethink.

================
Comment at: clang/utils/TableGen/MveEmitter.cpp:207
+    // pointer, especially if the pointee is also const.
+    if (isa<PointerType>(Pointee)) {
+      if (Const)
----------------
dmgreen wrote:
> Would making this always east const be simpler? Do we generate anything with inner pointer types? Should there be a whitespace before the "const " in Name += "const "?
There shouldn't be a whitespace, because clang-format agrees that the right spacing is `char *const p` and not `char * const p` :-)

But, hmm, now you mention it, it may well be that we never actually need a pointer-to-pointer type in the current state of the ACLE spec. Previous versions did have pointers to pointers, for written-back base addresses in contiguous loads/stores, but those intrinsics were removed in later drafts. So perhaps you're right that this is now unnecessary complexity.

================
Comment at: clang/utils/TableGen/MveEmitter.cpp:325
+    return Element->c_name_base() + "x" + utostr(Registers);
+  }
+
----------------
dmgreen wrote:
> How come this doesn't have/need a llvm_name? Are they just all handled manually?
Yes: the multi-vector types like `int32x4x2_t` are only used by the `vld2`/`vst4` family of intrinsics, and those are the ones I implemented using handwritten codegen in C++, precisely //because// they use struct types that are needed by only a tiny set of intrinsics, so it seemed pointless to build a huge amount of extra very general machinery into this tablegen system when it was quicker to just handwrite the results for that handful of intrinsics.

I'll add a comment explaining that.

================
Comment at: clang/utils/TableGen/MveEmitter.cpp:382
+// ones are the same across the whole set (so that no variable is needed at
+// all).
+//
----------------
dmgreen wrote:
> Cool.
> 
> Do you know how much this helps? Do the benefits outweigh the costs in the complexity of this code?
> 
> And do you know, once we have added a lot of instruction how long this will take to run (I know, difficult question to answer without doing it first, but is it roughly O(n log n)? And memory will never be excessive? A very unscientific test I just ran made it seem pretty quick, but that was obviously without the number of intrinsics we will have in the end). We should try and ensure that we don't make this a compile time burden for llvm.
I don't know //for sure// whether this merging system is worth it. It's hard to judge without having implemented the full set of intrinsics.

I expect it to be algorithmically //more or less// O(n log n). My inner data-structures nerd (ok, not very inner) isn't very fond of the string-keyed `std::map` that groups together all the related pieces of code, on the grounds that for string-keyed structures tries are asymptotically better than the easy approach of doing O(log n) strcmps, but I think this case is unlikely to run into the pathological case of the latter, so it should be fine.

(If there were a trie available in `include/llvm/ADT`, I'd love to use it, but I didn't find one...)

So I think it shouldn't be excessively slow at Tablegen time; and since its effect is to reduce the amount of C++ that has to be processed in a single function by a //full optimizing compiler// later in the clang build, my instinct is that it's likely to save time overall.

================
Comment at: clang/utils/TableGen/MveEmitter.cpp:455
+
+class Value {
+public:
----------------
dmgreen wrote:
> Is it worth giving this a different name to more obviously distinguish it from llvm::Value?
I agree, although it's tricky to think of a good name. I've gone for `Result`.

================
Comment at: clang/utils/TableGen/MveEmitter.cpp:723
+    CodeGenParamAllocator DummyParamAlloc;
+    V->gen_code(nulls(), DummyParamAlloc);
+
----------------
dmgreen wrote:
> What is this doing?
Hmm, apparently nothing – if I remove it, tablegen gives identical output. Well spotted. Possibly it was a relic of some earlier idea I had of how to do the two-pass parameter allocation.

================
Comment at: clang/utils/TableGen/MveEmitter.cpp:755
+    }
+    std::list<Value::Ptr> used;
+    gen_code_dfs(Code, used, Pass);
----------------
dmgreen wrote:
> Used
> 
> And why a list, out of interest?
Just on the general principle of 'don't demand more capabilities from your data structure than you actually need' – I didn't need indexed random access in this case, and a list is the simplest structure that supports constant-time append and iteration. I can make it a vector if you prefer. (Perhaps the constant factors are better.)

================
Comment at: clang/utils/TableGen/MveEmitter.cpp:1023
+    Value::Ptr Arg = getCodeForDagArg(D, 0, Scope, Param);
+    if (const auto *ST = dyn_cast<ScalarType>(CastType)) {
+      if (!ST->requires_float()) {
----------------
dmgreen wrote:
> Do you envision this may require float const or vector casts in the future?
I certainly don't expect to need float casts, because they're destructive of information (so using one should //never// be the right codegen strategy for any MVE intrinsic), and also because floats never need widening to fit in an Arm register: all the vector instructions that deal with scalar floats (which as far as I can remember is just the lane get/set family) can handle each float in its existing format. (Unlike <32 bit ints, for which it will be very common that the underlying LLVM IR intrinsic finds it easier to deal in register-sized i32, so the clang codegen will need to widen and narrow appropriately.)

Bitcasting between vector types is a possibility, I suppose. We can always add another case in here if we turn out to need it.

================
Comment at: clang/utils/TableGen/MveEmitter.cpp:1595
+      // Is this parameter the same for all intrinsics in the group?
+      bool Constant = true;
+      auto it = kv.second.begin();
----------------
dmgreen wrote:
> Can this use (something like):
> bool Constant = llvm::all_of(kv.second, [&](const auto &OI) { return OI.ParamValues[i] == kv.second[0].ParamValues[i]; } ?
Can't say `kv.second[0]` when `kv.second` is a set rather than a vector, but apart from that, yes, that seems to work fine.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D67161/new/

https://reviews.llvm.org/D67161