D52177: [InstCombine] Fold ~A - Min/Max(~A, O) -> Max/Min(A, ~O) - A

Thu Oct 11 15:20:58 PDT 2018

Oh, no. That's not what I wanted to hear. I presume we are looking at the same bit of code!

This was intended to fix things on the rgb kernel, and did pretty well on our tests. I think it was a 45% increase on some cpus, such as the m0plus. You can probably guess this was on Arm, in that case a thumb1 target. It sounds like we were hitting pretty much the opposite conditions, with the old code doing badly around the selects of nots in our case. I had presumed the smaller IR would have produced better code for everyone.

We do end up converting the whole thing to i32 because i8's are not legal types for our registers. We will do things with i8 for vectors on say aarch64 or v8a, but the cortex-m microcontrollers won't have that. I presume that Goldmont is in the same boat, not having vectorisation? (but, perhaps, supporting the instructions?)  The vectorised code looked pretty good (especially on aarch64; I tried skylake too, which was better, but had a lot of shuffling going on).

Um, in terms of fixes, I guess our alternatives are make this instcombine dependant on the target, or try to undo it somehow in the backend. My understanding is that the first option is not generally done in instcombine. Can someone give me some history there? Was it something that was decided as a rule, or just something that never came up. (I guess also in this case, not supporting this fold because it happens to cause the extra subs in a random test isn't a great reason to not do it).

The other options would be something like; if there multiple subs with the same (lhs?) operand, and that whole thing is invertible (which may be easy said than done), we invert it all back using ~a - ~b = b - a. I'm not sure how easy that is exactly. This case is a bit of a tangle of multiple uses.
Dave

From: Craig Topper via Phabricator <reviews at reviews.llvm.org>
Sent: 11 October 2018 20:13

craig.topper added a comment.

I'm seeing a regression on goldmont and silvermont cpus in an rgb cmyk conversion benchmark in 32-bit mode. What I've observed is that the 3 subtracts in the code all now have the same LHS register. X86 destroys the LHS of a subtract instruction so we have  to make copies before the subtracts. We're in 32-bit mode so our 8-bit register choices are %al, %bl, %cl, %dl, %ah, %bh, %ch, %dh. Silvermont and Goldmont have bad partial register handling for writing high and low 8-bit registers. Unlike Sandy Bridge, Haswell,  Skylake, the high and low registers aren't renamed independently. I tried playing around with promoting everything to 32-bits to avoid the partial registers, but that was actually worse somehow.

@dmgreen what target are you using?

Repository:
  rL LLVM

https://reviews.llvm.org/D52177

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.