[PATCH 4/6] R600: Add zero undef variants of ctlz/cttz tests.
arsenm2 at gmail.com
Sat Jun 14 16:45:21 PDT 2014
On Jun 14, 2014, at 2:10 PM, Jan Vesely <jan.vesely at rutgers.edu> wrote:
> On Fri, 2014-06-13 at 11:07 -0700, Matt Arsenault wrote:
>> On 06/13/2014 08:45 AM, Jan Vesely wrote:
>>> On Fri, 2014-06-13 at 11:24 -0400, Jan Vesely wrote:
>>>> On Thu, 2014-06-12 at 12:52 -0700, Matt Arsenault wrote:
>>>>> On Jun 12, 2014, at 12:41 PM, Jan Vesely <jan.vesely at rutgers.edu> wrote:
>>>>> Is it really correct to use bcnt for this? I was working on matching
>>>>> the undef versions a while ago and used FFBL / FFBH instructions,
>>>>> although I haven’t tried running these yet
>>>> You are right. I didn't check whether there's a better instruction for
>>>> I got ffbh/ffbl running on my TURKS card, but I'm unsure about
>>> The confusing part is the use of S_ instruction. With your patches I
>>> how is it different from:
>>> S_LOAD_DWORD s0
>>> V_FFBH_U32_e32 v0, s0
>>> BUFFER_STORE_DWORD v0, ...
>>> other than executing the computation on every work-item. is there
>>> power/performance difference?
>> Theoretically using the SALU instructions is faster and uses less power,
>> as well as saves VGPRs. In general it should be better to keep anything
>> on the SALU whenever possible, but I don't know the details of how SALU
>> instructions are executed or how helpful it is (other than helping with
>> register usage)
> aha, so the idea is that everything is first generated for SALU, and
> code that needs to run on every work-item is converted to VALU ops?
> is that why ctlz_zero_undef matches S_FLBIT, but not V_FFBH?
Yes, everything is preferably selected to the scalar instructions and the SIFixSGPRCopies pass replaces them
as required to satisfy the operand constraints of the users.
>> It would be useful to have tests that actually execute both variants in
>> piglit since it's highly likely I got these backwards (I swapped them at
>> one point). It's also confusing because the names are different between
>> the S and V versions.
> AFAICT the two patches look good in this regard, S_FLBIT starts from MSB
> and matches ctlz (and vice versa with S_FFI). Not sure if i can give
> full RB, since I don't have SI hw or complete understanding of the
> SALU/VALU transformation, but the patches look good to me.
> If you plan to push those patches I can rebase on top of them and add
> support for pre-SI GPUs and i64.
>> As a side note, I have been using a global load from a pointer argument
>> as a way to enforce using VALU instructions in tests, but that this
>> works is a missing optimization. The pointer needs to be dynamically
>> indexed into by a VGPR, because loading a constant offset from a kernel
>> argument pointer could be optimized into an s_load
> speaking about optimization, how does SALU op + V_MOV compare to
> equivalent VALU op?
I’m not really sure about how the SALU executes. My current understanding is that every cycle an SALU and a VALU instruction can be executed, and the VALU instructions will take 4, 8, or 16 cycles to complete the entire wavefront. A v_mov will take 4 cycles like any other full rate instruction. It also isn’t always necessary to do the v_mov, since depending on the used instruction encoding, one SGPR can directly be used as a VALU instruction source
More information about the llvm-commits