[PATCH] D36562: [Bitfield] Make the bitfield a separate location if it has width of legal integer type and its bit offset is naturally aligned for the type

Sun Sep 3 22:40:47 PDT 2017

On 09/04/2017 12:12 AM, Xinliang David Li wrote:
>
>
> On Sun, Sep 3, 2017 at 9:23 PM, Hal Finkel <hfinkel at anl.gov 
> <mailto:hfinkel at anl.gov>> wrote:
>
>
>     On 09/03/2017 11:06 PM, Xinliang David Li wrote:
>>     I think we can think this in another way.
>>
>>     For modern CPU architectures which supports store forwarding with
>>     store queues, it is generally not "safe" to blindly do local
>>     optimizations to widen the load/stores
>
>     Why not widen stores? Generally the problem with store forwarding
>     is where the load is wider than the store (plus alignment issues).
>
>
> True, but probably with some caveats which are target dependent.  
> Store widening also requires additional bit operations (and possibly 
> addition load), so the it is tricky to model the the overall benefit.
>
>>     without sophisticated inter-procedural analysis. Doing so will
>>     run the risk of greatly reduce performance of some programs. Keep
>>     the naturally aligned load/store using its natural type is safer.
>>
>>     Does it make sense?
>
>     It makes sense. I could, however, say the same thing about
>     inlining. We need to make inlining decisions locally, but they
>     have global impact. Nevertheless, we need to make inlining
>     decisions, and there's no practical way to make them in a truly
>     non-local way.
>
>
> Speaking of inlining, we are actually thinking of ways to make the 
> decisions more globally optimal, but that is off topic.

Neat.

>
>     We also don't pessimize common cases to improve performance in
>     rare cases. In the common case, reducing pressure on the memory
>     units, and reducing the critical path, seem likely to me to be
>     optimal. If that's not true, or doing otherwise has negligible
>     cost (but can prevent rare downsides), we should certainly
>     consider those options.
>
>
> Since we don't do load widening for non-bitfield cases (but the only 
> the very limited case of naturally aligned bitfields) so it is hard to 
> say we pessimize common cases for rare cases:
>
> 1) the upside doing widening such access is not high from experience 
> with other compiler (which does not do so)
> 2) there is obvious downside of introducing additional extract 
> instructions which degrades performance
> 3) there is obvious downside of severely degrading performance when 
> store forwarding is blocked.

I suspect that it's relatively rare to hit these store-to-load 
forwarding issues compared to the number of times the systems stores or 
loads to bitfields. In any case, I did some experiments on my Haswell 
system and found that the load from Wei's benchmark which is split into 
two loads, compared to the single load version, is 0.012% slower. I, 
indeed, won't worry about that too much. On my P8, I couldn't measure a 
difference. Obviously, this does somewhat miss the point, as the real 
cost in this kind of thing comes in stressing the memory units in code 
with a lot going on, not in isolated cases.

Nevertheless, I think that you've convinced me that this is a least-bad 
solution. I'll want a flag preserving the old behavior. Something like 
-fwiden-bitfield-accesses (modulo bikeshedding).

>
>
>
>     And none of this answers the question of whether it's better to
>     have the store wider or the load split and narrow.
>
>
>
> It seems safer to do store widening more aggressively to avoid store 
> forwarding stall issue, but doing this aggressively may also mean 
> other runtime overhead introduced (extra load, data merge etc).

Yes. Wei confirmed that this is slower.

Thanks again,
Hal

>
> Thanks,
>
> David
>
>
>     Thanks again,
>     Hal
>
>
>>
>>     David
>>
>>
>>
>>     On Sun, Sep 3, 2017 at 8:55 PM, Hal Finkel <hfinkel at anl.gov
>>     <mailto:hfinkel at anl.gov>> wrote:
>>
>>
>>         On 09/03/2017 10:38 PM, Xinliang David Li wrote:
>>>         Store forwarding stall cost is usually much higher compared
>>>         with a load hitting L1 cache. For instance, on Haswell,  the
>>>         latter is ~4 cycles, while the store forwarding stalls cost
>>>         about 10 cycles more than a successful store forwarding,
>>>         which is roughly 15 cycles. In some scenarios, the store
>>>         forwarding stalls can be as high as 50 cycles. See Agner's
>>>         documentation.
>>
>>         I understand. As I understand it, there are two potential
>>         ways to fix this problem:
>>
>>          1. You can make the store wider (to match the size of the
>>         wide load, thus permitting forwarding).
>>          2. You can make the load smaller (to match the size of the
>>         small store, thus permitting forwarding).
>>
>>         At least in this benchmark, which is a better solution?
>>
>>         Thanks again,
>>         Hal
>>
>>
>>>
>>>         In other words, the optimizer needs to be taught to avoid
>>>         defeating  the HW pipeline feature as much as possible.
>>>
>>>         David
>>>
>>>         On Sun, Sep 3, 2017 at 6:32 PM, Hal Finkel <hfinkel at anl.gov
>>>         <mailto:hfinkel at anl.gov>> wrote:
>>>
>>>
>>>             On 09/03/2017 03:44 PM, Wei Mi wrote:
>>>
>>>                 On Sat, Sep 2, 2017 at 6:04 PM, Hal Finkel
>>>                 <hfinkel at anl.gov <mailto:hfinkel at anl.gov>> wrote:
>>>
>>>                     On 08/22/2017 10:56 PM, Wei Mi via llvm-commits
>>>                     wrote:
>>>
>>>                         On Tue, Aug 22, 2017 at 7:03 PM, Xinliang
>>>                         David Li <davidxl at google.com
>>>                         <mailto:davidxl at google.com>>
>>>                         wrote:
>>>
>>>
>>>                             On Tue, Aug 22, 2017 at 6:37 PM,
>>>                             Chandler Carruth via Phabricator
>>>                             <reviews at reviews.llvm.org
>>>                             <mailto:reviews at reviews.llvm.org>> wrote:
>>>
>>>                                 chandlerc added a comment.
>>>
>>>                                 I'm really not a fan of the degree
>>>                                 of complexity and subtlety that this
>>>                                 introduces into the frontend, all to
>>>                                 allow particular backend
>>>                                 optimizations.
>>>
>>>                                 I feel like this is Clang working
>>>                                 around a fundamental deficiency in
>>>                                 LLVM
>>>                                 and we should instead find a way to
>>>                                 fix this in LLVM itself.
>>>
>>>                                 As has been pointed out before, user
>>>                                 code can synthesize large integers
>>>                                 that small bit sequences are
>>>                                 extracted from, and Clang and LLVM
>>>                                 should
>>>                                 handle those just as well as actual
>>>                                 bitfields.
>>>
>>>                                 Can we see how far we can push the
>>>                                 LLVM side before we add complexity to
>>>                                 Clang here? I understand that there
>>>                                 remain challenges to LLVM's stuff,
>>>                                 but I
>>>                                 don't think those challenges make
>>>                                 *all* of the LLVM improvements off the
>>>                                 table, I don't think we've exhausted
>>>                                 all ways of improving the LLVM
>>>                                 changes
>>>                                 being proposed, and I think we
>>>                                 should still land all of those and
>>>                                 re-evaluate how important these
>>>                                 issues are when all of that is in place.
>>>
>>>
>>>                             The main challenge of doing  this in
>>>                             LLVM is that inter-procedural
>>>                             analysis
>>>                             (and possibly cross module) is needed
>>>                             (for store forwarding issues).
>>>
>>>                             Wei, perhaps you can provide concrete
>>>                             test case to illustrate the issue
>>>                             so
>>>                             that reviewers have a good understanding.
>>>
>>>                             David
>>>
>>>                         Here is a runable testcase:
>>>                         -------------------- 1.cc
>>>                         ------------------------
>>>                         class A {
>>>                         public:
>>>                             unsigned long f1:2;
>>>                             unsigned long f2:6;
>>>                             unsigned long f3:8;
>>>                             unsigned long f4:4;
>>>                         };
>>>                         A a;
>>>                         unsigned long b;
>>>                         unsigned long N = 1000000000;
>>>
>>>                         __attribute__((noinline))
>>>                         void foo() {
>>>                             a.f3 = 3;
>>>                         }
>>>
>>>                         __attribute__((noinline))
>>>                         void goo() {
>>>                             b = a.f3;
>>>                         }
>>>
>>>                         int main() {
>>>                             unsigned long i;
>>>                             for (i = 0; i < N; i++) {
>>>                               foo();
>>>                               goo();
>>>                             }
>>>                         }
>>>                         ------------------------------------------------------------
>>>                         Now trunk takes about twice running time
>>>                         compared with trunk + this
>>>                         patch. That is because trunk shrinks the
>>>                         store of a.f3 in foo (Done by
>>>                         DagCombiner) but not shrink the load of a.f3
>>>                         in goo, so store
>>>                         forwarding will be blocked.
>>>
>>>
>>>                     I can confirm that this is true on Haswell and
>>>                     also on an POWER8.
>>>                     Interestingly, on a POWER7, the performance is
>>>                     the same either way (on the
>>>                     good side). I ran the test case as presented and
>>>                     where I replaced f3 with a
>>>                     non-bitfield unsigned char member. Thinking that
>>>                     the POWER7 result might be
>>>                     because it's big-Endian, I flipped the order of
>>>                     the fields, and found that
>>>                     the version where f3 is not a bitfield is faster
>>>                     than otherwise, but only by
>>>                     12.5%.
>>>
>>>                     Why, in this case, don't we shrink the load? It
>>>                     seems like we should (and it
>>>                     seems like a straightforward case).
>>>
>>>                     Thanks again,
>>>                     Hal
>>>
>>>                 Hal, thanks for trying the test.
>>>
>>>                 Yes, it is straightforward to shrink the load in the
>>>                 test. I can
>>>                 change the testcase a little to show why it is
>>>                 sometime difficult to
>>>                 shrink the load:
>>>
>>>                 class A {
>>>                 public:
>>>                    unsigned long f1:16;
>>>                    unsigned long f2:16;
>>>                    unsigned long f3:16;
>>>                    unsigned long f4:8;
>>>                 };
>>>                 A a;
>>>                 bool b;
>>>                 unsigned long N = 1000000000;
>>>
>>>                 __attribute__((noinline))
>>>                 void foo() {
>>>                    a.f4 = 3;
>>>                 }
>>>
>>>                 __attribute__((noinline))
>>>                 void goo() {
>>>                    b = (a.f4 == 0 && a.f3 == (unsigned short)-1);
>>>                 }
>>>
>>>                 int main() {
>>>                    unsigned long i;
>>>                    for (i = 0; i < N; i++) {
>>>                      foo();
>>>                      goo();
>>>                    }
>>>                 }
>>>
>>>                 For the load a.f4 in goo, it is diffcult to motivate
>>>                 its shrink after
>>>                 instcombine because the comparison with a.f3 and the
>>>                 comparison with
>>>                 a.f4 are merged:
>>>
>>>                 define void @_Z3goov() local_unnamed_addr #0 {
>>>                    %1 = load i64, i64* bitcast (%class.A* @a to
>>>                 i64*), align 8
>>>                    %2 = and i64 %1, 0xffffff00000000
>>>                    %3 = icmp eq i64 %2, 0xffff00000000
>>>                    %4 = zext i1 %3 to i8
>>>                    store i8 %4, i8* @b, align 1, !tbaa !2
>>>                    ret void
>>>                 }
>>>
>>>
>>>             Exactly. But this behavior is desirable, locally.
>>>             There's no good answer here: We either generate extra
>>>             load traffic here (because we need to load the fields
>>>             separately), or we widen the store (which generates
>>>             extra load traffic there). Do you know, in terms of
>>>             performance, which is better in this case (i.e., is it
>>>             better to widen the store or split the load)?
>>>
>>>              -Hal
>>>
>>>
>>>
>>>                 Thanks,
>>>                 Wei.
>>>
>>>                         The testcases shows the potential problem of
>>>                         store shrinking. Before
>>>                         we decide to do store shrinking, we need to
>>>                         know all the related loads
>>>                         will be shrunk,  and that requires IPA
>>>                         analysis. Otherwise, when load
>>>                         shrinking was blocked for some difficult
>>>                         case (Like the instcombine
>>>                         case described in
>>>                         https://www.mail-archive.com/cfe-commits@lists.llvm.org/msg65085.html
>>>                         <https://www.mail-archive.com/cfe-commits@lists.llvm.org/msg65085.html>),
>>>                         performance regression will happen.
>>>
>>>                         Wei.
>>>
>>>
>>>
>>>                                 Repository:
>>>                                     rL LLVM
>>>
>>>                                 https://reviews.llvm.org/D36562
>>>                                 <https://reviews.llvm.org/D36562>
>>>
>>>
>>>
>>>                         _______________________________________________
>>>                         llvm-commits mailing list
>>>                         llvm-commits at lists.llvm.org
>>>                         <mailto:llvm-commits at lists.llvm.org>
>>>                         http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
>>>                         <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits>
>>>
>>>
>>>                     --
>>>                     Hal Finkel
>>>                     Lead, Compiler Technology and Programming Languages
>>>                     Leadership Computing Facility
>>>                     Argonne National Laboratory
>>>
>>>
>>>             -- 
>>>             Hal Finkel
>>>             Lead, Compiler Technology and Programming Languages
>>>             Leadership Computing Facility
>>>             Argonne National Laboratory
>>>
>>>
>>
>>         -- 
>>         Hal Finkel
>>         Lead, Compiler Technology and Programming Languages
>>         Leadership Computing Facility
>>         Argonne National Laboratory
>>
>>
>
>     -- 
>     Hal Finkel
>     Lead, Compiler Technology and Programming Languages
>     Leadership Computing Facility
>     Argonne National Laboratory
>
>

-- 
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-commits/attachments/20170904/4a0dc7ca/attachment-0001.html>