[llvm-commits] [llvm] r85697 - in /llvm/trunk: lib/Target/ARM/ARMInstrNEON.td test/CodeGen/ARM/fmacs.ll test/CodeGen/ARM/fnmacs.ll test/CodeGen/Thumb2/cross-rc-coalescing-2.ll

Mon Nov 2 01:53:12 PST 2009

On Nov 1, 2009, at 1:31 PM, Evan Cheng wrote:

>
> On Nov 1, 2009, at 10:22 AM, Anton Korobeynikov wrote:
>
>> Hello, Evan
>>
>>> On the other hand, a vmla.32 followed by another vmla.32 is just
>>> fine. And
>>> it is faster than vmul + vadd. I agree we should try to solve it
>>> better.
>>> Perhaps expanding it before or during schedule2.
>> Right, NEON scheduling is tricky, it seems that our instruction
>> itineraries are not expressible enough for such complex pipelines.
>
> I think we should be able to handle at least the true dependency
> cases. Instruction latency is a function of both defining instruction
> and the use. cc'ing David for his comments.

Whoops, I'm too used to MLs with reply-to set to the list address.

Anyway, I misread the commit, but separating vmla.f32 into vmul.f32 +  
vadd.f32 doesn't help either. The vmul+vadd chain will have the result  
available the same cycle as worst-case vmla: 9 cycles after issue.  
Ignoring pipelined instructions, the vmul+vadd will stall for 4 cycles  
between the instructions and 4 cycles after, equal to the 8 cycles for  
vmla in the same situation. The note in the ref manual about vmla/vmls  
is just calling attention to the special forwarding path available to  
only those instructions; the 8 cycle stall is what would always happen  
otherwise based on the cycle timings.

Thus even without modeling the special behaviour of vmla it's always  
better to use it: it'll always be at least as fast as a separate vmul 
+vadd. This applies to the integer versions as well.