[PATCH] D10964: [Codegen] Add intrinsics 'hadd*' and corresponding SDNodes for horizontal sum operation.
hfinkel at anl.gov
hfinkel at anl.gov
Sat Jul 25 15:29:47 PDT 2015
hfinkel added a comment.
In http://reviews.llvm.org/D10964#210687, @bruno wrote:
> > IMO, the scalar version is more natural w.r.t the HADD operation itself and also more canonical. Also, as you mentioned vector version will need an extractelement which may have some performance impact also.
> > In fact on X86, we need to do a DAGCombine of *ABSDIFF* and *HADD* to generate PSAD instruction which is our main objective for adding the two intrinsics. I feel vector version of HADD will complicate this DAGCombine.
> I don't see how making it return a vector type would affect performance. You will need a DAG combine anyway here and you don't need to deal with the extractelement at all, besides on the case where you really want to bring it back to a scalar result, this should be up to the front-end to generate. Also, we have precedence in using inserts/extracts with other vector nodes as the canonical way to represent lower level vector instructions.
> My point is that if we make the result available in the vector we have two advantages: (1) if there are additional vector operations on the result, we don't need to re-insert it into another vector to continue vector work and (2) if you want the result on a scalar, you might just use extractelement.
> In x86, PSAD returns a vector and if you want to keep the result on a scalar you need to generate the appropriate movs to do the work, I believe the same applies to other archs with similar instructions. IMO, it doesn't seem natural to always return the value on a scalar and then having to insert it back to a vector to proceed with vector work.
I understand your point, but I think that the intrinsic should return a scalar for conceptual clarity. It is, fundamentally, computing a scalar quantity. I understand, however, that doing this will require more work on our part to produce code of equivalent quality.
Specifically, we'll need code in CodeGenPrep to push and replicate insertelement(undef, hadd(x), 0) instructions 'up' (closer to the hadd(x)) so that CodeGen will always see the pair together.
Then for all backends such that the underlying hadd returns its result in a vector register, will need to pattern match the insertelement away to a noop instead of actually moving the result into the scalar register file. Not all backends will have hadds that work like this, but I believe X86 and AArch64 will, for example.
However, I believe this bit of extra work is worthwhile. The fact that some common ISAs have an horizontal add that happens to return the result of the add in some lane of an output vector is not something that we should expose at the IR level.
More information about the llvm-commits