[PATCH 4/6] R600: Add zero undef variants of ctlz/cttz tests.

Sat Jun 14 16:45:21 PDT 2014

On Jun 14, 2014, at 2:10 PM, Jan Vesely <jan.vesely at rutgers.edu> wrote:

> On Fri, 2014-06-13 at 11:07 -0700, Matt Arsenault wrote:
>> On 06/13/2014 08:45 AM, Jan Vesely wrote:
>>> On Fri, 2014-06-13 at 11:24 -0400, Jan Vesely wrote:
>>>> On Thu, 2014-06-12 at 12:52 -0700, Matt Arsenault wrote:
>>>>> On Jun 12, 2014, at 12:41 PM, Jan Vesely <jan.vesely at rutgers.edu> wrote:
>>>>> 
> 
> SNIP
> 
>>>>> 
>>>>> Is it really correct to use bcnt for this? I was working on matching
>>>>> the undef versions a while ago and used FFBL / FFBH instructions,
>>>>> although I haven’t tried running these yet
>>>> You are right. I didn't check whether there's a better instruction for
>>>> these.
>>>> 
>>>> I got ffbh/ffbl running on my TURKS card, but I'm unsure about
>>>> SI.
>>> The confusing part is the use of S_ instruction. With your patches I
>>> see:
>>>  S_LOAD_DWORD
>>>  S_FLBIT_I32_B32
>>>  V_MOV_B32_e32
>>>  BUFFER_STORE_DWORD
>>> 
>>> how is it different from:
>>>  S_LOAD_DWORD s0
>>>  V_FFBH_U32_e32 v0, s0
>>>  BUFFER_STORE_DWORD v0, ...
>>> 
>>> other than executing the computation on every work-item. is there
>>> power/performance difference?
>> 
>> Theoretically using the SALU instructions is faster and uses less power, 
>> as well as saves VGPRs. In general it should be better to keep anything 
>> on the SALU whenever possible, but I don't know the details of how SALU 
>> instructions are executed or how helpful it is (other than helping with 
>> register usage)
> 
> aha, so the idea is that everything is first generated for SALU, and
> code that needs to run on every work-item is converted to VALU ops?
> is that why ctlz_zero_undef matches S_FLBIT, but not V_FFBH?
Yes, everything is preferably selected to the scalar instructions and the SIFixSGPRCopies pass replaces them
as required to satisfy the operand constraints of the users.

> 
>> 
>> It would be useful to have tests that actually execute both variants in 
>> piglit since it's highly likely I got these backwards (I swapped them at 
>> one point). It's also confusing because the names are different between 
>> the S and V versions.
> 
> AFAICT the two patches look good in this regard, S_FLBIT starts from MSB
> and matches ctlz (and vice versa with S_FFI). Not sure if i can give
> full RB, since I don't have SI hw or complete understanding of the
> SALU/VALU transformation, but the patches look good to me.
> If you plan to push those patches I can rebase on top of them and add
> support for pre-SI GPUs and i64.
> 
>> 
>> As a side note, I have been using a global load from a pointer argument 
>> as a way to enforce using VALU instructions in tests, but that this 
>> works is a missing optimization. The pointer needs to be dynamically 
>> indexed into by a VGPR, because loading a constant offset from a kernel 
>> argument pointer could be optimized into an s_load
> 
> speaking about optimization, how does SALU op + V_MOV compare to
> equivalent VALU op?

I’m not really sure about how the SALU executes. My current understanding is that every cycle an SALU and a VALU instruction can be executed, and the VALU instructions will take 4, 8, or 16 cycles to complete the entire wavefront. A v_mov will take 4 cycles like any other full rate instruction. It also isn’t always necessary to do the v_mov, since depending on the used instruction encoding, one SGPR can directly be used as a VALU instruction source

-Matt