[llvm-commits] [llvm] r85697 - in /llvm/trunk: lib/Target/ARM/ARMInstrNEON.td test/CodeGen/ARM/fmacs.ll test/CodeGen/ARM/fnmacs.ll test/CodeGen/Thumb2/cross-rc-coalescing-2.ll
David Conrad
lessen42 at gmail.com
Mon Nov 2 01:53:12 PST 2009
On Nov 1, 2009, at 1:31 PM, Evan Cheng wrote:
>
> On Nov 1, 2009, at 10:22 AM, Anton Korobeynikov wrote:
>
>> Hello, Evan
>>
>>> On the other hand, a vmla.32 followed by another vmla.32 is just
>>> fine. And
>>> it is faster than vmul + vadd. I agree we should try to solve it
>>> better.
>>> Perhaps expanding it before or during schedule2.
>> Right, NEON scheduling is tricky, it seems that our instruction
>> itineraries are not expressible enough for such complex pipelines.
>
> I think we should be able to handle at least the true dependency
> cases. Instruction latency is a function of both defining instruction
> and the use. cc'ing David for his comments.
Whoops, I'm too used to MLs with reply-to set to the list address.
Anyway, I misread the commit, but separating vmla.f32 into vmul.f32 +
vadd.f32 doesn't help either. The vmul+vadd chain will have the result
available the same cycle as worst-case vmla: 9 cycles after issue.
Ignoring pipelined instructions, the vmul+vadd will stall for 4 cycles
between the instructions and 4 cycles after, equal to the 8 cycles for
vmla in the same situation. The note in the ref manual about vmla/vmls
is just calling attention to the special forwarding path available to
only those instructions; the 8 cycle stall is what would always happen
otherwise based on the cycle timings.
Thus even without modeling the special behaviour of vmla it's always
better to use it: it'll always be at least as fast as a separate vmul
+vadd. This applies to the integer versions as well.
More information about the llvm-commits
mailing list