[PATCH] D12032: Vector element extraction without stack operations on Power 8

hfinkel@anl.gov via llvm-commits llvm-commits at lists.llvm.org
Wed Aug 19 01:11:08 PDT 2015


hfinkel added inline comments.

================
Comment at: lib/Target/PowerPC/PPCInstrVSX.td:1284
@@ +1283,3 @@
+  dag LE_BYTE_1 = (i32 (EXTRACT_SUBREG (RLDICL LE_DWORD_0, 56, 8), sub_32));
+  dag LE_BYTE_2 = (i32 (EXTRACT_SUBREG (RLDICL LE_DWORD_0, 48, 16), sub_32));
+  dag LE_BYTE_3 = (i32 (EXTRACT_SUBREG (RLDICL LE_DWORD_0, 40, 24), sub_32));
----------------
wschmidt wrote:
> wschmidt wrote:
> > nemanjai wrote:
> > > nemanjai wrote:
> > > > wschmidt wrote:
> > > > > wschmidt wrote:
> > > > > > This code concerns me, starting with LE_BYTE_2.  I see from your test cases that it happens to work, but it looks fragile to me.
> > > > > > 
> > > > > > If you have bytes 7 6 5 4 3 2 1 0 and apply RLDICL 48, 16, you will get
> > > > > > X X X X X X 3 2 where the Xs have been cleared to zero.  The correct answer is X X X X X X X 2.
> > > > > > 
> > > > > > For RLDICL 40, 24, you will get X X X X X 5 4 3, which is incorrect for both byte and halfword extraction.
> > > > > > 
> > > > > > The pattern you need is (RLDICL LE_DWORD_0 48, 8), followed by 40,8, etc.  Then you need separate patterns for LE_HWORD that uses 16 instead of 8.
> > > > > > 
> > > > > > Somehow in your tests you are ending up with extsb instructions that make this work, which I really don't understand based on the code you're specifying here (which just creates an i32, not an i8).  I am concerned that those are an artifact that could disappear.  Can you explain how those come to be generated?
> > > > > Ah, I see why.  Your tests are returning an i8, which forces the conversion from i32 to i8 that generates the extsb.  Without that, the incorrect code generation would be exposed.
> > > > I think this isn't clearly specified in the language reference, but as far as I can tell, the remainder of the wider register when a value is extracted from a vector is undefined and hence need not be cleared. Namely, when an i8 or i16 is extracted from a vector, it must be extended (sign or zero) since we have no such small registers. Also, I don't think this is necessarily restricted to vector extraction. However we happen to get an i8 or i16 into a register, the SDAG will have a sign_extend or a zero_extend.
> > > > Even code like this:
> > > > 
> > > > ```
> > > > signed char f(signed char c) {
> > > >   return c;
> > > > }
> > > > ```
> > > > 
> > > > Will end up with the following SDAG nodes:
> > > > 
> > > > ```
> > > > 0x1003ec39830: i8,ch = load 0x1003ec39708, 0x1003ec39390, 0x1003ec395e0<LD1[%c.addr]> [ORD=4]
> > > > 0x1003ec3a8c0: i32 = sign_extend 0x1003ec39830 [ORD=5]
> > > > ```
> > > > 
> > > > Similarly, with code like this:
> > > > 
> > > > ```
> > > > signed char f(vector signed char c) {
> > > >   return c[3];
> > > > }
> > > > ```
> > > > 
> > > > We end up with a very similar pair of SDAG nodes:
> > > > 
> > > > ```
> > > > 0x10009e69b58: i8 = extract_vector_elt 0x10009e697e0, 0x10009e69a30 [ORD=5]
> > > > 0x10009e69c80: i32 = sign_extend 0x10009e69b58 [ORD=6]
> > > > ```
> > > > And the sign_extend then becomes an extsb. Of course, this does not apply with fast-isel, but then again, neither does any of this patch since it is all based on the SDAG.
> > > > I used signed char here just as an illustration since you mentioned the extsb instruction, but the equivalent zero_extend applies with the unsigned variants (in both cases).
> > > > 
> > > > So to summarize, I can add the clear instruction to each case, but I'm quite certain that it will be a reduntant instruction. It will also make the table gen logic more complex because I'd need a clear for the unsigned cases and a sign-extend for the signed cases.
> > > This code:
> > > 
> > > ```
> > > int f(vector signed char c, int i) {
> > >   return c[3];
> > > }
> > > ```
> > > Has the same extsb, which is then followed in this case by an extsw which is redundant, but understandable given the integral promotion.
> > You are relying on context for your patterns to argue that they will always be correct.  That is ok, but it is then incumbent on you to document this extremely well.  "Clever tricks" should never be unaccompanied by diligent commentary.  The fact is that if somebody were to use your "LE_BYTE_x" patterns in the (perfectly reasonable) belief that this would actually produce a right-adjusted zero- or sign-extended byte, they would be grossly disappointed.  Outside of their contextual uses in the vector_extract patterns, these would not produce what their name argues they would produce.  They would produce an i32 that is pretending to be an i8 but contains values not representable by an i8.
> > 
> > So go ahead and do this, but document the bejeebus out of it.  Then if something goes wrong with the contextual assumptions in the future, the poor maintainer will have half a chance of sorting it out.  When creating a building block, you have to be aware that others may want to use it in a way you didn't foresee.
> > 
> > I understand why you want to do this, but a more proper solution is to do it right and make sure that we have peephole optimizations later on that will remove redundant extend operations.  We need those anyway.  Therefore, go ahead and do what you're doing, but in addition to documenting it, add commentary indicating what ought to be done in the future.
> > 
> > Hal, please feel to weigh in...
> Actually, I'm going to stick to my guns on this one.
> 
> I'm not asking you to add another extend instruction.  I'm asking you to use the correct form of RLDICL for the LE_BYTE_x and LE_HWORD_x cases.  If you change the last parameter to RLDICL to always be 8, you will isolate a byte.  Do this for the LE_BYTE_x forms.  If you change it to always be 16, you will isolate a halfword.  Do this for the LE_HWORD_X forms.
> 
> This does not cost any extra instructions, and the patterns will generate i32s that contain true i8s and i16s, respectively.  Furthermore, this has an added benefit:  We can later optimize away the sign- or zero-extend instruction that gets generated by the context.  A peephole optimization can easily prove that a rldicl that has already isolated a byte does not need a subsequent extsb or second truncating rldicl.  The same is true for isolation of a halfword.
> 
> On the other hand, if you leave things as you have them, we can never get rid of the extsb, because the optimizer doesn't know that you left nonzero bits in there because you "knew" there would be an extend coming along to fix things up.
First, Nemanja appears to be right, it is legal to leave the higher-order bits arbitrarily defined. The code in DAGTypeLegalizer::PromoteIntOp_EXTRACT_VECTOR_ELT ends with:

  // EXTRACT_VECTOR_ELT can return types which are wider than the incoming
  // element types. If this is the case then we need to expand the outgoing
  // value and not truncate it.
  return DAG.getAnyExtOrTrunc(Ext, dl, N->getValueType(0));

Second, Bill is right. We should zero-out the higher-order bits if we can do so without any extra instructions. We already have a peephole optimization in PPCISelDAGToDAG.cpp to eliminate unnecessary i32->i64 extensions (see the PeepholePPC64ZExt() function), and it should be easy to extend it to catch this case as well. Also, it makes "manual" debugging easier to have the values zero extended (and making debugging easier at no additional cost is certainly a good thing).

And, for any pattern where you do leave some of the higher-order bits arbitrarily defined, please document that explicitly.



Repository:
  rL LLVM

http://reviews.llvm.org/D12032





More information about the llvm-commits mailing list