[llvm] 78eaff2 - [llvm-exegesis] Loop unrolling for loop snippet repetitor mode
Roman Lebedev via llvm-commits
llvm-commits at lists.llvm.org
Tue May 25 02:09:14 PDT 2021
Author: Roman Lebedev
Date: 2021-05-25T12:08:27+03:00
New Revision: 78eaff2ef8a984859a04f944522280360ee825aa
URL: https://github.com/llvm/llvm-project/commit/78eaff2ef8a984859a04f944522280360ee825aa
DIFF: https://github.com/llvm/llvm-project/commit/78eaff2ef8a984859a04f944522280360ee825aa.diff
LOG: [llvm-exegesis] Loop unrolling for loop snippet repetitor mode
I really needed this, like, factually, yesterday,
when verifying dependency breaking idioms for AMD Zen 3 scheduler model.
Consider the following example:
```
$ ./bin/llvm-exegesis --mode=inverse_throughput --snippets-file=/tmp/snippet.s --num-repetitions=1000000 --repetition-mode=duplicate
Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-4a7e50.o
---
mode: inverse_throughput
key:
instructions:
- 'VPXORYrr YMM0 YMM0 YMM0'
config: ''
register_initial_values: []
cpu_name: znver3
llvm_triple: x86_64-unknown-linux-gnu
num_repetitions: 1000000
measurements:
- { key: inverse_throughput, value: 0.31025, per_snippet_value: 0.31025 }
error: ''
info: ''
assembled_snippet: C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C5FDEFC0C3
...
```
What does it tell us?
So wait, it can only execute ~3 x86 AVX YMM PXOR zero-idioms per cycle?
That doesn't seem right. That's even less than there are pipes supporting this type of op.
Now, second example:
```
$ ./bin/llvm-exegesis --mode=inverse_throughput --snippets-file=/tmp/snippet.s --num-repetitions=1000000 --repetition-mode=loop
Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-2418b5.o
---
mode: inverse_throughput
key:
instructions:
- 'VPXORYrr YMM0 YMM0 YMM0'
config: ''
register_initial_values: []
cpu_name: znver3
llvm_triple: x86_64-unknown-linux-gnu
num_repetitions: 1000000
measurements:
- { key: inverse_throughput, value: 1.00011, per_snippet_value: 1.00011 }
error: ''
info: ''
assembled_snippet: 49B80800000000000000C5FDEFC0C5FDEFC04983C0FF75F2C3
...
```
Now that's just worse. Due to the looping, the throughput completely plummeted,
and now we can only do a single instruction/cycle!?
That's not great.
And final example:
```
$ ./bin/llvm-exegesis --mode=inverse_throughput --snippets-file=/tmp/snippet.s --num-repetitions=1000000 --repetition-mode=loop --loop-body-size=1000
Check generated assembly with: /usr/bin/objdump -d /tmp/snippet-c402e2.o
---
mode: inverse_throughput
key:
instructions:
- 'VPXORYrr YMM0 YMM0 YMM0'
config: ''
register_initial_values: []
cpu_name: znver3
llvm_triple: x86_64-unknown-linux-gnu
num_repetitions: 1000000
measurements:
- { key: inverse_throughput, value: 0.167087, per_snippet_value: 0.167087 }
error: ''
info: ''
assembled_snippet: 49B80800000000000000C5FDEFC0C5FDEFC04983C0FF75F2C3
...
```
So if we merge the previous two approaches, do duplicate this single-instruction snippet 1000x
(loop-body-size/instruction count in snippet), and run a loop with 1000 iterations
over that duplicated/unrolled snippet, the measured throughput goes through the roof,
up to 5.9 instructions/cycle, which finally tells us that this idiom is zero-cycle!
Reviewed By: courbet
Differential Revision: https://reviews.llvm.org/D102522
Added:
Modified:
llvm/docs/CommandGuide/llvm-exegesis.rst
llvm/tools/llvm-exegesis/lib/BenchmarkResult.h
llvm/tools/llvm-exegesis/lib/BenchmarkRunner.cpp
llvm/tools/llvm-exegesis/lib/BenchmarkRunner.h
llvm/tools/llvm-exegesis/lib/SnippetRepetitor.cpp
llvm/tools/llvm-exegesis/lib/SnippetRepetitor.h
llvm/tools/llvm-exegesis/llvm-exegesis.cpp
llvm/unittests/tools/llvm-exegesis/X86/SnippetRepetitorTest.cpp
Removed:
################################################################################
diff --git a/llvm/docs/CommandGuide/llvm-exegesis.rst b/llvm/docs/CommandGuide/llvm-exegesis.rst
index f32a6f6398a9f..bf2222ed553a9 100644
--- a/llvm/docs/CommandGuide/llvm-exegesis.rst
+++ b/llvm/docs/CommandGuide/llvm-exegesis.rst
@@ -189,7 +189,8 @@ OPTIONS
`latency` mode can be make use of either RDTSC or LBR.
`latency[LBR]` is only available on X86 (at least `Skylake`).
- To run in `latency` mode, a positive value must be specified for `x86-lbr-sample-period` and `--repetition-mode=loop`.
+ To run in `latency` mode, a positive value must be specified
+ for `x86-lbr-sample-period` and `--repetition-mode=loop`.
In `analysis` mode, you also need to specify at least one of the
`-analysis-clusters-output-file=` and `-analysis-inconsistencies-output-file=`.
@@ -202,23 +203,36 @@ OPTIONS
On choosing the "right" sampling period, a small value is preferred, but throttling
could occur if the sampling is too frequent. A prime number should be used to
avoid consistently skipping certain blocks.
-
+
.. option:: -repetition-mode=[duplicate|loop|min]
Specify the repetition mode. `duplicate` will create a large, straight line
- basic block with `num-repetitions` copies of the snippet. `loop` will wrap
- the snippet in a loop which will be run `num-repetitions` times. The `loop`
- mode tends to better hide the effects of the CPU frontend on architectures
+ basic block with `num-repetitions` instructions (repeating the snippet
+ `num-repetitions`/`snippet size` times). `loop` will, optionally, duplicate the
+ snippet until the loop body contains at least `loop-body-size` instructions,
+ and then wrap the result in a loop which will execute `num-repetitions`
+ instructions (thus, again, repeating the snippet
+ `num-repetitions`/`snippet size` times). The `loop` mode, especially with loop
+ unrolling tends to better hide the effects of the CPU frontend on architectures
that cache decoded instructions, but consumes a register for counting
- iterations. If performing an analysis over many opcodes, it may be best
- to instead use the `min` mode, which will run each other mode, and produce
- the minimal measured result.
+ iterations. If performing an analysis over many opcodes, it may be best to
+ instead use the `min` mode, which will run each other mode,
+ and produce the minimal measured result.
.. option:: -num-repetitions=<Number of repetitions>
- Specify the number of repetitions of the asm snippet.
+ Specify the target number of executed instructions. Note that the actual
+ repetition count of the snippet will be `num-repetitions`/`snippet size`.
Higher values lead to more accurate measurements but lengthen the benchmark.
+.. option:: -loop-body-size=<Preferred loop body size>
+
+ Only effective for `-repetition-mode=[loop|min]`.
+ Instead of looping over the snippet directly, first duplicate it so that the
+ loop body contains at least this many instructions. This potentially results
+ in loop body being cached in the CPU Op Cache / Loop Cache, which allows to
+ which may have higher throughput than the CPU decoders.
+
.. option:: -max-configs-per-opcode=<value>
Specify the maximum configurations that can be generated for each opcode.
diff --git a/llvm/tools/llvm-exegesis/lib/BenchmarkResult.h b/llvm/tools/llvm-exegesis/lib/BenchmarkResult.h
index c883a3409ae31..29db84df592cc 100644
--- a/llvm/tools/llvm-exegesis/lib/BenchmarkResult.h
+++ b/llvm/tools/llvm-exegesis/lib/BenchmarkResult.h
@@ -67,7 +67,7 @@ struct InstructionBenchmark {
const MCInst &keyInstruction() const { return Key.Instructions[0]; }
// The number of instructions inside the repeated snippet. For example, if a
// snippet of 3 instructions is repeated 4 times, this is 12.
- int NumRepetitions = 0;
+ unsigned NumRepetitions = 0;
enum RepetitionModeE { Duplicate, Loop, AggregateMin };
// Note that measurements are per instruction.
std::vector<BenchmarkMeasure> Measurements;
diff --git a/llvm/tools/llvm-exegesis/lib/BenchmarkRunner.cpp b/llvm/tools/llvm-exegesis/lib/BenchmarkRunner.cpp
index f3ce184b4e93d..03e7ccc26f463 100644
--- a/llvm/tools/llvm-exegesis/lib/BenchmarkRunner.cpp
+++ b/llvm/tools/llvm-exegesis/lib/BenchmarkRunner.cpp
@@ -133,7 +133,7 @@ class FunctionExecutorImpl : public BenchmarkRunner::FunctionExecutor {
} // namespace
Expected<InstructionBenchmark> BenchmarkRunner::runConfiguration(
- const BenchmarkCode &BC, unsigned NumRepetitions,
+ const BenchmarkCode &BC, unsigned NumRepetitions, unsigned LoopBodySize,
ArrayRef<std::unique_ptr<const SnippetRepetitor>> Repetitors,
bool DumpObjectToDisk) const {
InstructionBenchmark InstrBenchmark;
@@ -168,14 +168,16 @@ Expected<InstructionBenchmark> BenchmarkRunner::runConfiguration(
// Assemble at least kMinInstructionsForSnippet instructions by repeating
// the snippet for debug/analysis. This is so that the user clearly
// understands that the inside instructions are repeated.
- constexpr const int kMinInstructionsForSnippet = 16;
+ const int MinInstructionsForSnippet = 4 * Instructions.size();
+ const int LoopBodySizeForSnippet = 2 * Instructions.size();
{
SmallString<0> Buffer;
raw_svector_ostream OS(Buffer);
if (Error E = assembleToStream(
State.getExegesisTarget(), State.createTargetMachine(),
BC.LiveIns, BC.Key.RegisterInitialValues,
- Repetitor->Repeat(Instructions, kMinInstructionsForSnippet),
+ Repetitor->Repeat(Instructions, MinInstructionsForSnippet,
+ LoopBodySizeForSnippet),
OS)) {
return std::move(E);
}
@@ -187,8 +189,8 @@ Expected<InstructionBenchmark> BenchmarkRunner::runConfiguration(
// Assemble NumRepetitions instructions repetitions of the snippet for
// measurements.
- const auto Filler =
- Repetitor->Repeat(Instructions, InstrBenchmark.NumRepetitions);
+ const auto Filler = Repetitor->Repeat(
+ Instructions, InstrBenchmark.NumRepetitions, LoopBodySize);
object::OwningBinary<object::ObjectFile> ObjectFile;
if (DumpObjectToDisk) {
diff --git a/llvm/tools/llvm-exegesis/lib/BenchmarkRunner.h b/llvm/tools/llvm-exegesis/lib/BenchmarkRunner.h
index cc51b5b8486b0..b66902e6c0d6c 100644
--- a/llvm/tools/llvm-exegesis/lib/BenchmarkRunner.h
+++ b/llvm/tools/llvm-exegesis/lib/BenchmarkRunner.h
@@ -41,6 +41,7 @@ class BenchmarkRunner {
Expected<InstructionBenchmark>
runConfiguration(const BenchmarkCode &Configuration, unsigned NumRepetitions,
+ unsigned LoopUnrollFactor,
ArrayRef<std::unique_ptr<const SnippetRepetitor>> Repetitors,
bool DumpObjectToDisk) const;
diff --git a/llvm/tools/llvm-exegesis/lib/SnippetRepetitor.cpp b/llvm/tools/llvm-exegesis/lib/SnippetRepetitor.cpp
index c866e972a1c46..1851cb4674339 100644
--- a/llvm/tools/llvm-exegesis/lib/SnippetRepetitor.cpp
+++ b/llvm/tools/llvm-exegesis/lib/SnippetRepetitor.cpp
@@ -11,6 +11,7 @@
#include "SnippetRepetitor.h"
#include "Target.h"
+#include "llvm/ADT/Sequence.h"
#include "llvm/CodeGen/TargetInstrInfo.h"
#include "llvm/CodeGen/TargetSubtargetInfo.h"
@@ -24,8 +25,8 @@ class DuplicateSnippetRepetitor : public SnippetRepetitor {
// Repeats the snippet until there are at least MinInstructions in the
// resulting code.
- FillFunction Repeat(ArrayRef<MCInst> Instructions,
- unsigned MinInstructions) const override {
+ FillFunction Repeat(ArrayRef<MCInst> Instructions, unsigned MinInstructions,
+ unsigned LoopBodySize) const override {
return [Instructions, MinInstructions](FunctionFiller &Filler) {
auto Entry = Filler.getEntry();
if (!Instructions.empty()) {
@@ -53,17 +54,26 @@ class LoopSnippetRepetitor : public SnippetRepetitor {
State.getTargetMachine().getTargetTriple())) {}
// Loop over the snippet ceil(MinInstructions / Instructions.Size()) times.
- FillFunction Repeat(ArrayRef<MCInst> Instructions,
- unsigned MinInstructions) const override {
- return [this, Instructions, MinInstructions](FunctionFiller &Filler) {
+ FillFunction Repeat(ArrayRef<MCInst> Instructions, unsigned MinInstructions,
+ unsigned LoopBodySize) const override {
+ return [this, Instructions, MinInstructions,
+ LoopBodySize](FunctionFiller &Filler) {
const auto &ET = State.getExegesisTarget();
auto Entry = Filler.getEntry();
auto Loop = Filler.addBasicBlock();
auto Exit = Filler.addBasicBlock();
+ const unsigned LoopUnrollFactor =
+ LoopBodySize <= Instructions.size()
+ ? 1
+ : divideCeil(LoopBodySize, Instructions.size());
+ assert(LoopUnrollFactor >= 1 && "Should end up with at least 1 snippet.");
+
// Set loop counter to the right value:
- const APInt LoopCount(32, (MinInstructions + Instructions.size() - 1) /
- Instructions.size());
+ const APInt LoopCount(
+ 32,
+ divideCeil(MinInstructions, LoopUnrollFactor * Instructions.size()));
+ assert(LoopCount.uge(1) && "Trip count should be at least 1.");
for (const MCInst &Inst :
ET.setRegTo(State.getSubtargetInfo(), LoopCounter, LoopCount))
Entry.addInstruction(Inst);
@@ -78,7 +88,10 @@ class LoopSnippetRepetitor : public SnippetRepetitor {
Loop.MBB->addLiveIn(Reg);
for (const auto &LiveIn : Entry.MBB->liveins())
Loop.MBB->addLiveIn(LiveIn);
- Loop.addInstructions(Instructions);
+ for (auto _ : seq(0U, LoopUnrollFactor)) {
+ (void)_;
+ Loop.addInstructions(Instructions);
+ }
ET.decrementLoopCounterAndJump(*Loop.MBB, *Loop.MBB,
State.getInstrInfo());
diff --git a/llvm/tools/llvm-exegesis/lib/SnippetRepetitor.h b/llvm/tools/llvm-exegesis/lib/SnippetRepetitor.h
index 038dcac1e16db..239fa25408d22 100644
--- a/llvm/tools/llvm-exegesis/lib/SnippetRepetitor.h
+++ b/llvm/tools/llvm-exegesis/lib/SnippetRepetitor.h
@@ -39,7 +39,8 @@ class SnippetRepetitor {
// Returns a functor that repeats `Instructions` so that the function executes
// at least `MinInstructions` instructions.
virtual FillFunction Repeat(ArrayRef<MCInst> Instructions,
- unsigned MinInstructions) const = 0;
+ unsigned MinInstructions,
+ unsigned LoopBodySize) const = 0;
explicit SnippetRepetitor(const LLVMState &State) : State(State) {}
diff --git a/llvm/tools/llvm-exegesis/llvm-exegesis.cpp b/llvm/tools/llvm-exegesis/llvm-exegesis.cpp
index dc0619fbfbbc1..220f40481f4da 100644
--- a/llvm/tools/llvm-exegesis/llvm-exegesis.cpp
+++ b/llvm/tools/llvm-exegesis/llvm-exegesis.cpp
@@ -116,6 +116,13 @@ static cl::opt<unsigned>
cl::desc("number of time to repeat the asm snippet"),
cl::cat(BenchmarkOptions), cl::init(10000));
+static cl::opt<unsigned>
+ LoopBodySize("loop-body-size",
+ cl::desc("when repeating the instruction snippet by looping "
+ "over it, duplicate the snippet until the loop body "
+ "contains at least this many instruction"),
+ cl::cat(BenchmarkOptions), cl::init(0));
+
static cl::opt<unsigned> MaxConfigsPerOpcode(
"max-configs-per-opcode",
cl::desc(
@@ -365,7 +372,7 @@ void benchmarkMain() {
for (const BenchmarkCode &Conf : Configurations) {
InstructionBenchmark Result = ExitOnErr(Runner->runConfiguration(
- Conf, NumRepetitions, Repetitors, DumpObjectToDisk));
+ Conf, NumRepetitions, LoopBodySize, Repetitors, DumpObjectToDisk));
ExitOnFileError(BenchmarkFile, Result.writeYaml(State, BenchmarkFile));
}
exegesis::pfm::pfmTerminate();
diff --git a/llvm/unittests/tools/llvm-exegesis/X86/SnippetRepetitorTest.cpp b/llvm/unittests/tools/llvm-exegesis/X86/SnippetRepetitorTest.cpp
index dcc9617065b03..b5aca1eb35366 100644
--- a/llvm/unittests/tools/llvm-exegesis/X86/SnippetRepetitorTest.cpp
+++ b/llvm/unittests/tools/llvm-exegesis/X86/SnippetRepetitorTest.cpp
@@ -42,11 +42,13 @@ class X86SnippetRepetitorTest : public X86TestBase {
const auto Repetitor = SnippetRepetitor::Create(RepetitionMode, State);
const std::vector<MCInst> Instructions = {MCInstBuilder(X86::NOOP)};
FunctionFiller Sink(*MF, {X86::EAX});
- const auto Fill = Repetitor->Repeat(Instructions, kMinInstructions);
+ const auto Fill =
+ Repetitor->Repeat(Instructions, kMinInstructions, kLoopBodySize);
Fill(Sink);
}
static constexpr const unsigned kMinInstructions = 3;
+ static constexpr const unsigned kLoopBodySize = 5;
std::unique_ptr<LLVMTargetMachine> TM;
std::unique_ptr<LLVMContext> Context;
@@ -78,7 +80,9 @@ TEST_F(X86SnippetRepetitorTest, Loop) {
ASSERT_EQ(MF->getNumBlockIDs(), 3u);
const auto &LoopBlock = *MF->getBlockNumbered(1);
EXPECT_THAT(LoopBlock.instrs(),
- ElementsAre(HasOpcode(X86::NOOP), HasOpcode(X86::ADD64ri8),
+ ElementsAre(HasOpcode(X86::NOOP), HasOpcode(X86::NOOP),
+ HasOpcode(X86::NOOP), HasOpcode(X86::NOOP),
+ HasOpcode(X86::NOOP), HasOpcode(X86::ADD64ri8),
HasOpcode(X86::JCC_1)));
EXPECT_THAT(LoopBlock.liveins(),
UnorderedElementsAre(
More information about the llvm-commits
mailing list