[llvm-dev] combineRepeatedFPDivisors design questions

Mon Mar 6 04:11:21 PST 2017

Hello,

Recently I have stumbled across an almost double performance decrease in one of my codebases when compiling with clang instead of gcc. I was testing with the currently latest 3.9.1 as well as 4.0.0 rc2 targeting x86-64 darwin with -Ofast, and after some investigation was able to narrow down the code to the following snippet:

__attribute__ ((noinline)) bool calculate(double c) {
	uint32_t b = 0;
	uint32_t a = 0;
	while (b < 0xFFFFFFFF) {
		if ((uint64_t)(b / c) % 2 == 0)
			a++;
		b++;
	}
	return a;
}

The culprit here is the division, which is not hoisted out of the loop and replaced with a multiplication by a reciprocal. From what I understand the current implementation of combineRepeatedFPDivisors responsible for floating point division conversion does that given that the following preconditions are met:
— unsafe math optimisations are enabled;
— the amount of the divisions by the same divisor in a single basic block is above the threshold (2 for x86-64).

According to the IR the divisions are in two separate blocks, and I guess that it must be the reason for the optimisation not taking place. This seems to be proven by using an even loop count like 0xFFFFFFF0 to avoid a middle check. In this case the optimisation works just fine. Here are the IR samples: https://ghostbin.com/paste/r4td2 <https://ghostbin.com/paste/r4td2>.

At this point the thoughts arising on my side are:
— Should this be considered a bug and reported to llvm-bugs or a desired behaviour?
— If this is considered a bug, how is one supposed to fix it? With the exact sample provided the compiler could generally predict the amount of iterations being over the threshold and optimise the output, however, the real world examples (including mine) will unlikely be that simple. Loop body would be much larger, the conditions would be more complicated, the divisions would likely occur in several different basic blocks for various reasons, and I think in general it is not possible to predict the iteration count.
— From what I can tell at least current gcc (6.3.0) and icc (17.0.2) are capable of doing that, with both this snippet and even when the loop count depends on an another argument, which means they are not capable to determine whether there would only be 0 iterations or many more, however, the optimisation would still be applied (https://ghostbin.com/paste/ttnrz <https://ghostbin.com/paste/ttnrz>). Perhaps llvm could also be more permissive when it sees a loop?

Best regards,
Vit

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20170306/a51e8fe2/attachment.html>