[llvm-dev] What should a truncating store do?

Mon Sep 25 12:13:45 PDT 2017

So what is the correct behavior of the store <2 x i2>. Storing two bytes with a zext:ed i2 in each, or a bit packed vector?

I can't remember any documentation mentioning that vectors are bit packed. But if LLVM is supposed to bit pack vectors, should we do it for any element size, or only when element size is less than the byte size, or only for i1 vectors?

Maybe bit packing should be optional (target specific)?

Btw, the IR/ISD note for a <2 x i2> store indicates that it is an one byte (ST1) store, due to the getStoreSize() methods saying one byte for this kind of vector. So alias analysis etc thinks that only one byte is written.
________________________________________
From: Friedman, Eli <efriedma at codeaurora.org>
Sent: Monday, September 25, 2017 8:08:01 PM
To: Björn Pettersson A; Jon Chesterfield
Cc: llvm-dev at lists.llvm.org
Subject: Re: [llvm-dev] What should a truncating store do?

On 9/25/2017 9:14 AM, Björn Pettersson A wrote:
(Not sure if this exactly maps to “truncating store”, but I think it at least touches some of the subjects discussed in this thread)

Our out-of-tree-target need several patches to get things working correctly for us.
We have introduced i24 and i40 types in ValueTypes/MachineValueTypes (in addition to the normal pow-of-2 types). And we have vectors of those (v2i40, v4i40).
And the byte size in our target is 16 bits.

When storing an i40 we need to store it as three 16-bit bytes, i.e. 48 bits.
When storing a v4i40 vector it will be stored as 4x48 bits.

One thing that we have had to patch is the getStoreSize() method in ValueTypes/MachineValueTypes where we assume that vectors are bitpacked when the element size is smaller than the byte size (“BitsPerByte”):

     /// Return the number of bytes overwritten by a store of the specified value
     /// type.
     unsigned getStoreSize() const {
-      return (getSizeInBits() + 7) / 8;
+      // We assume that vectors with elements smaller than the byte size are
+      // bitpacked. And that elements larger than the byte size should be padded
+      // (e.g. i40 type for Phoenix is stored using 3 bytes (48 bits)).
+      bool PadElementsToByteSize =
+        isVector() && getScalarSizeInBits() >= BitsPerByte;
+      if (PadElementsToByteSize)
+        return getVectorNumElements() * getScalarType().getStoreSize();
+      return (getSizeInBits() + (BitsPerByte-1)) / BitsPerByte;
     }

The patch seems to work for in-tree-target tests as well as our out-of-tree target.
If it is a correct assumption for all targets is beyond my knowledge. Maybe only i1 vectors should be bitpacked?

Anyway, I think the bitpacked cases is very special (we do not use it for our target…).
AFAIK bitcast is defined as writing to memory followed by a load using a different type. And I think that doing several scalar operations should give the same result as when using vectors. So bitcast of bitpacked vectors should probably be avoided?

Yes, store+load is the right definition of bitcast.  And in fact, the backend will lower a bitcast to a store+load to a stack temporary in cases where there isn't some other lowering specified.

The end result is probably going to be pretty inefficient unless your target has a special instruction to handle it (x86 has pmovmskb for i1 vector bitcasts, but otherwise you probably end up with some terrible lowering involving a lot of shifts).

This also reminded me of the following test case that is in trunk:  test/CodeGen/X86/pr20011.ll

%destTy = type { i2, i2 }
define void @crash(i64 %x0, i64 %y0, %destTy* nocapture %dest) nounwind {
; X64-LABEL: crash:
; X64:       # BB#0:
; X64-NEXT:    andl $3, %esi
; X64-NEXT:    movb %sil, (%rdx)
; X64-NEXT:    andl $3, %edi
; X64-NEXT:    movb %dil, (%rdx)
; X64-NEXT:    retq
  %x1 = trunc i64 %x0 to i2
  %y1 = trunc i64 %y0 to i2
  %1 = bitcast %destTy* %dest to <2 x i2>*
  %2 = insertelement <2 x i2> undef, i2 %x1, i32 0
  %3 = insertelement <2 x i2> %2, i2 %y1, i32 1
  store <2 x i2> %3, <2 x i2>* %1, align 1
  ret void
}

As you can see by the “X64” checks the behavior is quite weird.
Both movb instructions writes to the same address. So the result of the store <2 x i2> will be the same as when only storing one of the elements.
Is this really expected?

We have emailed Simon Pilgrim who added the test case to show that we no longer crash on this test case (see https://bugs.llvm.org/show_bug.cgi?id=20011). But even if the compiler doesn’t crash, the behavior seems wrong to me.

Yes, the behavior here is wrong.  DAGTypeLegalizer::SplitVecOp_STORE/DAGTypeLegalizer::SplitVecRes_LOAD/etc. assume the element size is a multiple of 8.  I'm sure this has been discussed before, but I guess nobody ever wrote a patch to fix it...?

-Eli

--
Employee of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project