[PATCH][X86] Fix for cannot select broadcast

Nadav Rotem nrotem at apple.com
Mon Mar 24 10:00:47 PDT 2014


On Mar 24, 2014, at 9:51 AM, Quentin Colombet <qcolombet at apple.com> wrote:

> Thanks Nadav.
> 
> On Mar 23, 2014, at 2:40 PM, Nadav Rotem <nrotem at apple.com> wrote:
> 
>> 
>> On Mar 21, 2014, at 3:55 PM, Quentin Colombet <qcolombet at apple.com> wrote:
>> 
>>> Hi Nadav,
>>> 
>>> Here is a new patch that provides the fallback patterns.
>>> 
>>> That said, the new patterns creates less efficient code that my initial proposal.
>>> 
>>> I.e., fallback patterns:
>>> 	movb	(%rdi), %al
>>> 	vmovd	%eax, %xmm1
>>> 	vpbroadcastb	%xmm1, %xmm1
>>> Orignial patch:
>>> 	vpbroadcastb	(%rdi), %xmm1
>>> 
>>> You told me that a later pass is supposed to catch the new opportunities of load folding. Looks like it does not know how to do that across copies.
>>> We would have to fix that at some point.
>>> 
>> 
>> foldMemoryOperandImpl in X86InstrInfo.cpp folds memory operands into instructions. It won’t work for integers because unlike floats you first need to move the scalar value into the vector register.
> Agreed, that’s why I said it does not know how to do that across copies :).
> 
>> 
>>> Anyhow, I think both patches are orthogonal, having fallback patterns looks like a good idea but selection directly the instructions that match what X86ISelLowering thought it was lowering seems a good approach too :).
>>> 
>>> What do you think?
>> 
>> I think that vbroadcast instructions are special because of they way that they are used. In many cases the loaded value could be hoisted outside the loop but selecting the vbroadcast MI prevents the hoisting of the load.
> I see. Though the broadcast could be hoisted outside of the loop too, isn’t it?

Oh, right, my mistake. Yes, it can also be hoisted out. 

> 
>> One example is constants that are materialized to constant pool loads. Ideally we would like to save the scalar constant in memory and load it outside of the loop. However, by selecting the vbroadcast we prevent the hoisting of the load. Currently we save the pre-broadcasted constant in memory, which is wasteful. Loading from memory and splatting a value across a vector register is not an expensive operation and vbroadcast is not a huge win for the typical case. vbroadcast *is* beneficial for tiled matrix multiplications because you need to multiply a scalar by a vector, but matmul are usually hand-optimized. The typical case where vbroadcast is selected is where a constant is multiplied by a vector. Maybe you could benchmark the llvm test suite and see if it really matters. Last time I checked I found that in-register broadcasts, what were introduced in AVX2 are useful, but the AVX1 broadcasts from memory were not. 
> If the patch with the fallback patterns looks good to you, I’d like to commit it to fix the instruction selection crash.

Yes, please. LGTM. 

> 
> In parallel, I will benchmark the llvm test suite to see if the other patch would help performances.
> 
>> 
>>> 
>>> As a side question, do you happen to have a test case for the existing fallback patterns?
>>> The original commit  r155437 do not have any test case and I was wondering how a load could have new users added to it (some constant load?). Thus, I wanted to check we were not already working around the problem I mentioned.
>> 
>> I don’t remember exactly but I can think of a case that can cause this problem.  Let’s say that in two places in the program we use the vector value [1.0,1.0,1.0,1.0].  When we lower the first use we turn it into a broadcast of the loaded scalar 1.0. When we lower the second value we do the same thing, except that this time we have to uses to the “1.0” scalar. Also, scalar loads to constant values are introduced in many places in ISel lowering. 
> Makes sense.
> 
> Thanks,
> -Quentin
>> 
>>> 
>>> Thanks,
>>> -Quentin
>>> <broadcast_fallback.patch>
>>> 
>>> On Mar 21, 2014, at 1:25 PM, Quentin Colombet <qcolombet at apple.com> wrote:
>>> 
>>>> Talked with Nadav off-line.
>>>> 
>>>> In fact X86 backend has fallback patterns for such cases. We may just miss a few of them. Looking into it.
>>>> 
>>>> Thanks,
>>>> -Quentin
>>>> 
>>>> On Mar 21, 2014, at 12:49 PM, Quentin Colombet <qcolombet at apple.com> wrote:
>>>> 
>>>>> Hi Nadav and Owen,
>>>>> 
>>>>> Here is a patch that fixes a cannot select failure for broadcast instruction in the X86 backend <rdar://problem/16074331>.
>>>>> Although the fix is in the x86 backend (Nadav’s domain), the root cause of the problem is in SelectionDAGISel (Owen’s domain).
>>>>> 
>>>>> Thanks for your review and feedbacks.
>>>>> 
>>>>> 
>>>>> ** Symptom **
>>>>> 
>>>>> Compiling the test case in the patch with x86_64 and avx2 would produce:
>>>>> --
>>>>> LLVM ERROR: Cannot select: 0x7f9d02059128: v8i16 = X86ISD::VBROADCAST 0x7f9d02037178 [ORD=9] [ID=13]
>>>>>   0x7f9d02037178: i16,ch = load 0x7f9d0205ab18, 0x7f9d02059230, 0x7f9d02036b48<LD2[%cV_R.addr](align=4)> [ORD=7] [ID=11]
>>>>>     0x7f9d02059230: i64,ch = CopyFromReg 0x7f9d01512a30, 0x7f9d02036a40 [ORD=1] [ID=8]
>>>>>       0x7f9d02036a40: i64 = Register %vreg0 [ID=1]
>>>>>     0x7f9d02036b48: i64 = undef [ID=3]
>>>>> In function: isel_crash_broadcast
>>>>> FileCheck error: '-' is empty.
>>>>> 
>>>>> --
>>>>> 
>>>>> 
>>>>> ** Problem **
>>>>> 
>>>>> The instruction supposed to match is a broadcast of a value coming from the memory, i.e., a load folded into a broadcast.
>>>>> The load is used only once and the broadcast does not have any other input dependencies. Thus, the folding is possible modulo a proper scheduling of the broadcast node.
>>>>> 
>>>>> The problem here is that during the select phase of the  isel process, SelectionDAGISel performs a check to see if the load is foldable into the broadcast (HandleMergeInputChains). This test is overly conservative and fails to detect that the folding is okay.
>>>>> 
>>>>> 
>>>>> ** Cause **
>>>>> 
>>>>> HandleMergeInputChains is not actually checking that a cycle will be created if something is folded into something else.
>>>>> In particular, it is not checking for reachability but instead relies on heuristics to give a quick answer. This answer is conservatively correct:
>>>>> - No: no cycle will be created.
>>>>> - Yes: a cycle *may* be created. 
>>>>> 
>>>>> In this example we have something like this:
>>>>> a = ld @a
>>>>> |   \
>>>>> |    st b, @b
>>>>> C = vbroadcast a
>>>>> 
>>>>> Here, we are trying to fold a into C and there is a chain between the load of a and the store of b.
>>>>> Since this chain is not part of the pattern it assumes it will create a cycle:
>>>>>  st b, @b
>>>>>    ^
>>>>>     |
>>>>>     v  
>>>>> C = vbroadcast ld @b
>>>>> 
>>>>> This is wrong, because the chain is in one direction C -> st b, i.e., we can schedule C before st b.
>>>>> I beleive this limitation is intended for three reasons:
>>>>> 1. to avoid costly reachability checks.
>>>>> 2. to handle only the rewriting of token factor from the not-yet-matched nodes to the matched node.
>>>>> 3. everything that has been matched is ready to schedule.
>>>>> 
>>>>> Interestingly, inverting the selection order of the st and vbroadcast nodes in this dag, solves the issue, because now the store is considered as scheduled and therefore cannot create a cycle (condition #3).
>>>>> 
>>>>> 
>>>>> ** Proposed Solution **
>>>>> 
>>>>> Fixing the root cause of the problem requires to change a lot of assumption in SelectionDAGISel, in particular #3. Moreover, #1 may be very harmful for the compile time.
>>>>> Therefore, the proposed fix works around the problem by forcing a valid schedule for the broadcast instruction to please the select phase. In parallel, I will file a PR for the general issue.
>>>>> 
>>>>> Other ideas are welcome!
>>>>> 
>>>>> For the details, the patch is transforming this:
>>>>> load -- chain --> someNode  -- more deps ->
>>>>>       \
>>>>>         +-- use --> broadcast -- more deps ->
>>>>> 
>>>>> Into this:
>>>>> load                          +-- chain  --> someNode  -- more deps ->
>>>>>       \                       /
>>>>>         +-- use --> broadcast -- more deps ->
>>>>> 
>>>>> Thus, the broadcast will be schedule in place of the load. By construction, this is valid, because load has only one use and broadcast only one input dependency (the load).
>>>>> 
>>>>> Thanks,
>>>>> -Quentin
>>>>> <broadcast.patch>
>>>>> _______________________________________________
>>>>> llvm-commits mailing list
>>>>> llvm-commits at cs.uiuc.edu
>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>>> 
>>>> _______________________________________________
>>>> llvm-commits mailing list
>>>> llvm-commits at cs.uiuc.edu
>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20140324/0731a2dd/attachment.html>


More information about the llvm-commits mailing list