[llvm-dev] Vectorizing remainder loop

Thu Aug 2 15:28:51 PDT 2018

Hi Hameeza,

Aside from Ashutosh's patch.....

When the vector width is that large, we can't keep vectorizing remainder like below. It'll be a huge code size if nothing else ---- hitting ITLB miss because of this is very bad, for example.
	VF=2048 // main vector loop
	VF=1024 // vectorized remainder 1
	VF=512   // vectorized remainder 2
	...
	Vectorize remainder until trip count is small enough for scalar execution.

Direction #1
Does your HW support efficient masking? If so, the first thing to try is VF=2048 with masking so that you won't have any remainder loop. In other words, bump up the trip count to the multiple of 2048 and then have an IF branch inside the loop body so that beyond the original trip count is a no-op. Then vectorize that loop.

	For (i=0;i<N;i++){
		body
	}
==>
	For (i=0;i<M;i++){ // where M is a multiple of 2048
		If (I < N) {
			Body
		}
	}

If your HW can't execute vector version of the above loop efficiently enough, it's already busted. Typically, when VF is that large, what you'll get in the remainder is masked vector like below, and vec_remainder_body is reasonably hot as you say in your original mail. As such, remainder loop vectorization isn't a solution for that problem.

	for (i=0;i<N;i+=2048){
		Vec_body
	}
	for (i<M;i+=1024){ // where M is the smallest multiple of 1024 over N
		If (I < N) {
			Vec_Remainder_Body
		}
	}

If your HW designers insist that the compiler to generate
	VF=2048 // main vector loop
	VF=1024 // vectorized remainder 1
	VF=512  // vectorized remainder 2
	...
	Remainder is small enough for scalar.
I suggest you go back and tell them to reconsider the HW design such that the Direction #1 works well enough on the HW.

Direction #2
In the meantime, if you are really stuck in the situation (i.e,, HW is already built and you don't have much time), the simplest thing for you to do is to run the LV second (third/fourth/...) time, after marking the remainder loop with the metadata so that you know which loops you want to deal with in the second round. It's very much of a hack but it'll be a small change you need to make and that way you are not much impacted by other changes VPlan project is making. If you have a major change outside of the trunk, you may be hit hard.

Direction #3
If you are given time to do the right implementation of remainder loop vectorization, please join the VPlan bandwagon and work on it there. Major development like this should happen on VPlans. Please let us know if you can do that. Ashutosh, how about you?

Hopefully, one or more of the four alternative directions to consider, including Ashutosh's patch, would work for you.

Thanks,
Hideki
-------------------
Date: Mon, 30 Jul 2018 05:16:15 +0000
From: "Nema, Ashutosh via llvm-dev" <llvm-dev at lists.llvm.org>
To: hameeza ahmed <hahmed2305 at gmail.com>, Craig Topper
	<craig.topper at gmail.com>, Hal Finkel <hfinkel at anl.gov>, "Friedman,
	Eli" <efriedma at codeaurora.org>
Cc: llvm-dev <llvm-dev at lists.llvm.org>
Subject: Re: [llvm-dev] Vectorizing remainder loop

Hi Hameeza,

At this point Loop Vectorizer does not have capability to vectorize epilog/remainder loop.
Sometime back there is an RFC on epilog loop vectorization but it did not went through because of concerns.
This RFC has a patch as well, maybe you can give a try with it.
http://llvm.1065342.n5.nabble.com/llvm-dev-Proposal-RFC-Epilog-loop-vectorization-tt106322.html#none

- Ashutosh

From: llvm-dev <llvm-dev-bounces at lists.llvm.org> On Behalf Of hameeza ahmed via llvm-dev
Sent: Sunday, July 29, 2018 10:24 PM
To: llvm-dev <llvm-dev at lists.llvm.org>; Craig Topper <craig.topper at gmail.com>; Hal Finkel <hfinkel at anl.gov>; Friedman, Eli <efriedma at codeaurora.org>
Subject: Re: [llvm-dev] Vectorizing remainder loop

Please help in solving this issue. the issue of scalar remainder loop is really big and significant with large vector widths.

Please help

Thank You

On Sun, Jul 29, 2018 at 2:52 PM, hameeza ahmed <hahmed2305 at gmail.com<mailto:hahmed2305 at gmail.com>> wrote:
Hello, I m working on a hardware with very large vector width till v2048. Now when I vectorize using llvm default vectorizer maximum 2047 iterations are scalar remainder loop. These are not vectorized by llvm which increases the cost. However these should be vectorized using next available vector width I.e v1024, v512, v256, v128, v64, v32, v16, v8, v4.....

The issue of scalar remainder loop has been there in llvm but this issue is enhanced and can't be ignored with large vector width. This is very important and significant to solve this issue.

Please help. I m trying to see loopvectorizer.cpp but unable to figure out actual code to make changes.

It's very important for me to solve this issue.

Please help.

Thank you