[llvm] [AArch64] Set MaxInterleaving to 4 for Neoverse V2 (PR #100385)

Thu Aug 15 11:05:14 PDT 2024

sjoerdmeijer wrote:

> Thanks - there is a cam_r in there too I think.

Thanks @davemgreen, I have now been able to run that, and I can confirm the 4.5% regression in cam4.
Before, the vector body was processing 8 elements with an interleaving factor of 2, and after interleaving 4x it processes 16 elements. The vector body is now no longer executed because of the low tripcount (< 16), and only the epilogue loop is executed. Not only is the vector body not executed, but the function is called a lot and the epilogue is very hot. 

I would like to think that these are two exceptions, and that interleaving with a maximum of 4 is the sensible and generally beneficial thing to do; cam4 is the exception of all codes that I ran.

The way forward is to follow up on this and look into epilogue vectorisation. At the moment, epilogue vectorisation is effectively disabled because it will kick in only if the tripcount is 16 or more:

       static cl::opt<unsigned> EpilogueVectorizationMinVF(
           "epilogue-vectorization-minimum-VF", cl::init(16), cl::Hidden,
            cl::desc("Only loops with vectorization factor equal to or larger than "
                            "the specified value are considered for epilogue vectorization."));

By lowering this to 8, I have verified that we gain almost all performance back for cam4.
I am hoping that lowering this to 8 for the Neoverse V2 is not going to be a revolution, and that's what I would like to investigate next if we are happy with the interleaving.

WDYT?

https://github.com/llvm/llvm-project/pull/100385