[PATCH] D12032: Vector element extraction without stack operations on Power 8

Tue Aug 18 18:08:29 PDT 2015

wschmidt added inline comments.

================
Comment at: lib/Target/PowerPC/PPCInstrVSX.td:1284
@@ +1283,3 @@
+  dag LE_BYTE_1 = (i32 (EXTRACT_SUBREG (RLDICL LE_DWORD_0, 56, 8), sub_32));
+  dag LE_BYTE_2 = (i32 (EXTRACT_SUBREG (RLDICL LE_DWORD_0, 48, 16), sub_32));
+  dag LE_BYTE_3 = (i32 (EXTRACT_SUBREG (RLDICL LE_DWORD_0, 40, 24), sub_32));
----------------
nemanjai wrote:
> nemanjai wrote:
> > wschmidt wrote:
> > > wschmidt wrote:
> > > > This code concerns me, starting with LE_BYTE_2.  I see from your test cases that it happens to work, but it looks fragile to me.
> > > > 
> > > > If you have bytes 7 6 5 4 3 2 1 0 and apply RLDICL 48, 16, you will get
> > > > X X X X X X 3 2 where the Xs have been cleared to zero.  The correct answer is X X X X X X X 2.
> > > > 
> > > > For RLDICL 40, 24, you will get X X X X X 5 4 3, which is incorrect for both byte and halfword extraction.
> > > > 
> > > > The pattern you need is (RLDICL LE_DWORD_0 48, 8), followed by 40,8, etc.  Then you need separate patterns for LE_HWORD that uses 16 instead of 8.
> > > > 
> > > > Somehow in your tests you are ending up with extsb instructions that make this work, which I really don't understand based on the code you're specifying here (which just creates an i32, not an i8).  I am concerned that those are an artifact that could disappear.  Can you explain how those come to be generated?
> > > Ah, I see why.  Your tests are returning an i8, which forces the conversion from i32 to i8 that generates the extsb.  Without that, the incorrect code generation would be exposed.
> > I think this isn't clearly specified in the language reference, but as far as I can tell, the remainder of the wider register when a value is extracted from a vector is undefined and hence need not be cleared. Namely, when an i8 or i16 is extracted from a vector, it must be extended (sign or zero) since we have no such small registers. Also, I don't think this is necessarily restricted to vector extraction. However we happen to get an i8 or i16 into a register, the SDAG will have a sign_extend or a zero_extend.
> > Even code like this:
> > 
> > ```
> > signed char f(signed char c) {
> >   return c;
> > }
> > ```
> > 
> > Will end up with the following SDAG nodes:
> > 
> > ```
> > 0x1003ec39830: i8,ch = load 0x1003ec39708, 0x1003ec39390, 0x1003ec395e0<LD1[%c.addr]> [ORD=4]
> > 0x1003ec3a8c0: i32 = sign_extend 0x1003ec39830 [ORD=5]
> > ```
> > 
> > Similarly, with code like this:
> > 
> > ```
> > signed char f(vector signed char c) {
> >   return c[3];
> > }
> > ```
> > 
> > We end up with a very similar pair of SDAG nodes:
> > 
> > ```
> > 0x10009e69b58: i8 = extract_vector_elt 0x10009e697e0, 0x10009e69a30 [ORD=5]
> > 0x10009e69c80: i32 = sign_extend 0x10009e69b58 [ORD=6]
> > ```
> > And the sign_extend then becomes an extsb. Of course, this does not apply with fast-isel, but then again, neither does any of this patch since it is all based on the SDAG.
> > I used signed char here just as an illustration since you mentioned the extsb instruction, but the equivalent zero_extend applies with the unsigned variants (in both cases).
> > 
> > So to summarize, I can add the clear instruction to each case, but I'm quite certain that it will be a reduntant instruction. It will also make the table gen logic more complex because I'd need a clear for the unsigned cases and a sign-extend for the signed cases.
> This code:
> 
> ```
> int f(vector signed char c, int i) {
>   return c[3];
> }
> ```
> Has the same extsb, which is then followed in this case by an extsw which is redundant, but understandable given the integral promotion.
You are relying on context for your patterns to argue that they will always be correct.  That is ok, but it is then incumbent on you to document this extremely well.  "Clever tricks" should never be unaccompanied by diligent commentary.  The fact is that if somebody were to use your "LE_BYTE_x" patterns in the (perfectly reasonable) belief that this would actually produce a right-adjusted zero- or sign-extended byte, they would be grossly disappointed.  Outside of their contextual uses in the vector_extract patterns, these would not produce what their name argues they would produce.  They would produce an i32 that is pretending to be an i8 but contains values not representable by an i8.

So go ahead and do this, but document the bejeebus out of it.  Then if something goes wrong with the contextual assumptions in the future, the poor maintainer will have half a chance of sorting it out.  When creating a building block, you have to be aware that others may want to use it in a way you didn't foresee.

I understand why you want to do this, but a more proper solution is to do it right and make sure that we have peephole optimizations later on that will remove redundant extend operations.  We need those anyway.  Therefore, go ahead and do what you're doing, but in addition to documenting it, add commentary indicating what ought to be done in the future.

Hal, please feel to weigh in...

================
Comment at: lib/Target/PowerPC/PPCInstrVSX.td:1309
@@ +1308,3 @@
+  // LE variable byte
+  dag LE_VBYTE_PERM_VEC = (LVSL ZERO8, (ANDC8 (LI8 8), $Idx));
+  dag LE_VBYTE_PERMUTE = (VPERM $S, $S, LE_VBYTE_PERM_VEC);
----------------
nemanjai wrote:
> wschmidt wrote:
> > This looks wrong.  If $Idx = 19, (ANDC8 (LI8 8), $Idx) will AND 0...001000 with 1...101100, giving 0...001000 = 8.  Basically this expression will always produce either 0 or 8, right?  That is surely not what you want.  This same issue seems repeated several places below, so I've stopped trying to figure these expressions out for now.
> > 
> > Have you done any execution testing to check that the code you're generating works?  I'm skeptical that it can.
> I may be missing something here, but always getting a 0 or 8 is precisely what I was after. Basically, if the element is on the left half of the VSR, I don't need to shift it. If it is on the right side of the VSR, I do need to shift it and I need to shift it by EXACTLY 8 bytes. This applies on both LE and BE except that the element numbers that are on the left vs. right half of the VSR are exact opposites.
> For LE, the bytes 0-7 need to be shifted before the MFVSR (so the expression you mentioned returns 8).
> 
> Subsequent to this shift, what is then on the left side of the VSR gets moved out to a GPR.
> Now that I have the right half of the VSR in the GPR, I need to shift to the right by the correct number of bytes (0-7). So element zero, needs not shift at this point, whereas elements 1-7 need to be shifted by the equal number of bytes. Similarly, element 8 needs not shift, whereas elements 9-15 need to be shifted by $Idx & 7 bytes. Of course, the shift amount I need to specify to SRD is in bits, hence the reason for the left shift of 3.
> 
> Here's the complete shift sequence for LE (bytes):
> 
> ```
> Elem    LShift in VSR    RShift in GPR (bytes)
> 0       8                0
> 1       8                1
> 2       8                2
> 3       8                3
> 4       8                4
> 5       8                5
> 6       8                6
> 7       8                7
> 8       0                0
> 9       0                1
> 10      0                2
> 11      0                3
> 12      0                4
> 13      0                5
> 14      0                6
> 15      0                7
> ```
> I have worked out similar sequences for the halfwords and for both on BE.
> Also, I have an execution test case that prints each element of each size vector both as constant element values and as variables. In my testing, the test case was compiled with clang and with gcc and the results compared with diff. I have run this on P8 systems, both LE and BE.
OK.  Again, this needs to be better documented.  Your commentary for the LShift portion is:
"What we do is set up the index by masking off bits we don't need and shifting accordingly."  You're not actually masking off any bits for this case, which is what confused me.  Based on that commentary, this didn't appear to fulfill your intent.  I note that your subsequent explanation to me doesn't have any mention of masking off bits either.

So please document more clearly what is going on with these patterns.  Too much commentary is never a crime...an expression like

  dag BE_VBYTE_SHIFT = (EXTRACT_SUBREG (RLDICR (ANDC8 (LI8 7), $Idx), 3, 60),
                                       sub_32);

without its own line of commentary is kind of a travesty.

If you don't mind, I would like to wait for a more heavily documented version before fully reviewing the variable extracts.

Repository:
  rL LLVM

http://reviews.llvm.org/D12032