[PATCH] D46283: [AArch64] Set vectorizer-maximize-bandwidth as default true

Mon May 21 14:33:33 PDT 2018

zatrazz added a comment.

For some reason I did not attach the meant comments in this update. This is an update of the previous patch with an extended analysis.  I checked a bootstrap build TargetTransformation::shouldMaximizeVectorBandwidth enabled for both armhf (r332595) and powerpc64le (r332840). On armhf I did not see any regression, however on powerpc64le I found an issue related on how current code handles the MaximizeBandwidth option.  The testcase 'Transforms/LoopVectorize/PowerPC/pr30990.ll' explicit sets vectorizer-maximize-bandwidth to 0, however the code checks for:

- lib/Transforms/Vectorize/LoopVectorize.cpp

4970   unsigned MaxVF = MaxVectorSize;
4971   if (TTI.shouldMaximizeVectorBandwidth(OptForSize) ||
4972       (MaximizeBandwidth && !OptForSize)) {

To enable/disable this optimization.  I think a possible fix would to check if Maximize is explicit disable (instead to check for its default value) by:

diff --git a/lib/Transforms/Vectorize/LoopVectorize.cpp b/lib/Transforms/Vectorize/LoopVectorize.cpp
index a65dc09..40c6583 100644

- a/lib/Transforms/Vectorize/LoopVectorize.cpp

+++ b/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -4968,7 +4968,8 @@ LoopVectorizationCostModel::computeFeasibleMaxVF(bool OptForSize,

  }
   
  unsigned MaxVF = MaxVectorSize;

- if (TTI.shouldMaximizeVectorBandwidth(OptForSize) ||

+  if (TTI.shouldMaximizeVectorBandwidth(OptForSize &&
+      !MaximizeBandwidth.getNumOccurrences()) ||

    (MaximizeBandwidth && !OptForSize)) {
  // Collect all viable vectorization factors larger than the default MaxVF
  // (i.e. MaxVectorSize).

But I think it should not act a block for this patch (I can sent it in another one if required).

Now related to performance differences, for speccpu2006 I see mixed results The machine I am testing show some variance, and even trying to get as minimum OS jitter as possible (by placing cpu allocation and binding to a specific node and disabling OS services), in two runs I see:

- RUN 1 (1 iteration)

Benchmark	Difference (%) *
400.perlbench	0.73
401.bzip2	0.01
403.gcc		0.53
429.mcf		0.05
445.gobmk	0.99
456.hmmer	-0.25
458.sjeng	1.02
462.libquantum	0.04
464.h264ref	0.28
471.omnetpp	0.30
473.astar	-0.11
483.xalancbmk	1.92

  			

433.milc	0.03
444.namd	-0.38
447.dealII	0.95
450.soplex	0.99
453.povray	1.24
470.lbm		-0.88
482.sphinx3	1.43

- RUN 2 (3 iteration, get the better result)

Benchmark	Difference (%) *
400.perlbench	0.66
401.bzip2	-2.84
403.gcc		0.09
429.mcf		0.46
445.gobmk	0.03
456.hmmer	-1.34
458.sjeng	0.12
462.libquantum	0.06
464.h264ref	0.45
471.omnetpp	-0.74
473.astar	1.05
483.xalancbmk	-0.57

  			

433.milc	-0.04
444.namd	0.14
447.dealII	-0.37
450.soplex	0.97
453.povray	-0.90
470.lbm		-0.88
482.sphinx3	0.28

On speccpu2017 the results are slighter more stable:

Benchmark		Difference (%) *		
600.perlbench_s		0,41
602.gcc_s		0,96
605.mcf_s		-0,87
620.omnetpp_s		1,74
623.xalancbmk_s		1,80
625.x264_s		0,16
631.deepsjeng_s		0,33
641.leela_s		0,38
657.xz_s		-0,14

  			

619.lbm_s		-0,45
638.imagick_s		0,09
644.nab_s		-0,10

It also shows some performance improvements on geekbench5 (it was run by in another machine by John Brawn from ARM):

Benchmark		Difference (%) *
AES			0,00
Camera			3,51
Canny			3,24
Clang			0,00
Dijkstra		0,00
FaceDetection		0,20
GaussianBlu		12,41
Grayscale		0,42
HDR			0,19
HTML5DOM		-0,14
HTML5Parse		2,88
HistogramEqualization	0,24
JPEG			0,18
LLVM			0,23
LMZA			0,00
LensBlur		0,00
Lua			-0,57
MemoryBandwidth		-0,14
MemoryCopy		0,08
MemoryLatencyPageRandom	0,04
Nbody			0,15
PDFRendering		-4,15
Particle		0,00
Raw			0,00
Raytrace		-0,24
RigidBody		-0,09
SFFT			0,46
SGEMM			0,00
SGEMMWithTaskQueue	-0,28
SQLite			0,63
Sobel			0,00
SpeechRecognition	0,10

The only regression seems PDFRendering, which I am investigating.

- Difference between r332336 with and without the patch, positive values represents a higher score indicating a improvement with the patch.


https://reviews.llvm.org/D46283