[PATCH] D36562: [Bitfield] Make the bitfield a separate location if it has width of legal integer type and its bit offset is naturally aligned for the type
Hal Finkel via llvm-commits
llvm-commits at lists.llvm.org
Sun Sep 3 22:40:47 PDT 2017
On 09/04/2017 12:12 AM, Xinliang David Li wrote:
>
>
> On Sun, Sep 3, 2017 at 9:23 PM, Hal Finkel <hfinkel at anl.gov
> <mailto:hfinkel at anl.gov>> wrote:
>
>
> On 09/03/2017 11:06 PM, Xinliang David Li wrote:
>> I think we can think this in another way.
>>
>> For modern CPU architectures which supports store forwarding with
>> store queues, it is generally not "safe" to blindly do local
>> optimizations to widen the load/stores
>
> Why not widen stores? Generally the problem with store forwarding
> is where the load is wider than the store (plus alignment issues).
>
>
> True, but probably with some caveats which are target dependent.
> Store widening also requires additional bit operations (and possibly
> addition load), so the it is tricky to model the the overall benefit.
>
>> without sophisticated inter-procedural analysis. Doing so will
>> run the risk of greatly reduce performance of some programs. Keep
>> the naturally aligned load/store using its natural type is safer.
>>
>> Does it make sense?
>
> It makes sense. I could, however, say the same thing about
> inlining. We need to make inlining decisions locally, but they
> have global impact. Nevertheless, we need to make inlining
> decisions, and there's no practical way to make them in a truly
> non-local way.
>
>
> Speaking of inlining, we are actually thinking of ways to make the
> decisions more globally optimal, but that is off topic.
Neat.
>
> We also don't pessimize common cases to improve performance in
> rare cases. In the common case, reducing pressure on the memory
> units, and reducing the critical path, seem likely to me to be
> optimal. If that's not true, or doing otherwise has negligible
> cost (but can prevent rare downsides), we should certainly
> consider those options.
>
>
> Since we don't do load widening for non-bitfield cases (but the only
> the very limited case of naturally aligned bitfields) so it is hard to
> say we pessimize common cases for rare cases:
>
> 1) the upside doing widening such access is not high from experience
> with other compiler (which does not do so)
> 2) there is obvious downside of introducing additional extract
> instructions which degrades performance
> 3) there is obvious downside of severely degrading performance when
> store forwarding is blocked.
I suspect that it's relatively rare to hit these store-to-load
forwarding issues compared to the number of times the systems stores or
loads to bitfields. In any case, I did some experiments on my Haswell
system and found that the load from Wei's benchmark which is split into
two loads, compared to the single load version, is 0.012% slower. I,
indeed, won't worry about that too much. On my P8, I couldn't measure a
difference. Obviously, this does somewhat miss the point, as the real
cost in this kind of thing comes in stressing the memory units in code
with a lot going on, not in isolated cases.
Nevertheless, I think that you've convinced me that this is a least-bad
solution. I'll want a flag preserving the old behavior. Something like
-fwiden-bitfield-accesses (modulo bikeshedding).
>
>
>
> And none of this answers the question of whether it's better to
> have the store wider or the load split and narrow.
>
>
>
> It seems safer to do store widening more aggressively to avoid store
> forwarding stall issue, but doing this aggressively may also mean
> other runtime overhead introduced (extra load, data merge etc).
Yes. Wei confirmed that this is slower.
Thanks again,
Hal
>
> Thanks,
>
> David
>
>
> Thanks again,
> Hal
>
>
>>
>> David
>>
>>
>>
>> On Sun, Sep 3, 2017 at 8:55 PM, Hal Finkel <hfinkel at anl.gov
>> <mailto:hfinkel at anl.gov>> wrote:
>>
>>
>> On 09/03/2017 10:38 PM, Xinliang David Li wrote:
>>> Store forwarding stall cost is usually much higher compared
>>> with a load hitting L1 cache. For instance, on Haswell, the
>>> latter is ~4 cycles, while the store forwarding stalls cost
>>> about 10 cycles more than a successful store forwarding,
>>> which is roughly 15 cycles. In some scenarios, the store
>>> forwarding stalls can be as high as 50 cycles. See Agner's
>>> documentation.
>>
>> I understand. As I understand it, there are two potential
>> ways to fix this problem:
>>
>> 1. You can make the store wider (to match the size of the
>> wide load, thus permitting forwarding).
>> 2. You can make the load smaller (to match the size of the
>> small store, thus permitting forwarding).
>>
>> At least in this benchmark, which is a better solution?
>>
>> Thanks again,
>> Hal
>>
>>
>>>
>>> In other words, the optimizer needs to be taught to avoid
>>> defeating the HW pipeline feature as much as possible.
>>>
>>> David
>>>
>>> On Sun, Sep 3, 2017 at 6:32 PM, Hal Finkel <hfinkel at anl.gov
>>> <mailto:hfinkel at anl.gov>> wrote:
>>>
>>>
>>> On 09/03/2017 03:44 PM, Wei Mi wrote:
>>>
>>> On Sat, Sep 2, 2017 at 6:04 PM, Hal Finkel
>>> <hfinkel at anl.gov <mailto:hfinkel at anl.gov>> wrote:
>>>
>>> On 08/22/2017 10:56 PM, Wei Mi via llvm-commits
>>> wrote:
>>>
>>> On Tue, Aug 22, 2017 at 7:03 PM, Xinliang
>>> David Li <davidxl at google.com
>>> <mailto:davidxl at google.com>>
>>> wrote:
>>>
>>>
>>> On Tue, Aug 22, 2017 at 6:37 PM,
>>> Chandler Carruth via Phabricator
>>> <reviews at reviews.llvm.org
>>> <mailto:reviews at reviews.llvm.org>> wrote:
>>>
>>> chandlerc added a comment.
>>>
>>> I'm really not a fan of the degree
>>> of complexity and subtlety that this
>>> introduces into the frontend, all to
>>> allow particular backend
>>> optimizations.
>>>
>>> I feel like this is Clang working
>>> around a fundamental deficiency in
>>> LLVM
>>> and we should instead find a way to
>>> fix this in LLVM itself.
>>>
>>> As has been pointed out before, user
>>> code can synthesize large integers
>>> that small bit sequences are
>>> extracted from, and Clang and LLVM
>>> should
>>> handle those just as well as actual
>>> bitfields.
>>>
>>> Can we see how far we can push the
>>> LLVM side before we add complexity to
>>> Clang here? I understand that there
>>> remain challenges to LLVM's stuff,
>>> but I
>>> don't think those challenges make
>>> *all* of the LLVM improvements off the
>>> table, I don't think we've exhausted
>>> all ways of improving the LLVM
>>> changes
>>> being proposed, and I think we
>>> should still land all of those and
>>> re-evaluate how important these
>>> issues are when all of that is in place.
>>>
>>>
>>> The main challenge of doing this in
>>> LLVM is that inter-procedural
>>> analysis
>>> (and possibly cross module) is needed
>>> (for store forwarding issues).
>>>
>>> Wei, perhaps you can provide concrete
>>> test case to illustrate the issue
>>> so
>>> that reviewers have a good understanding.
>>>
>>> David
>>>
>>> Here is a runable testcase:
>>> -------------------- 1.cc
>>> ------------------------
>>> class A {
>>> public:
>>> unsigned long f1:2;
>>> unsigned long f2:6;
>>> unsigned long f3:8;
>>> unsigned long f4:4;
>>> };
>>> A a;
>>> unsigned long b;
>>> unsigned long N = 1000000000;
>>>
>>> __attribute__((noinline))
>>> void foo() {
>>> a.f3 = 3;
>>> }
>>>
>>> __attribute__((noinline))
>>> void goo() {
>>> b = a.f3;
>>> }
>>>
>>> int main() {
>>> unsigned long i;
>>> for (i = 0; i < N; i++) {
>>> foo();
>>> goo();
>>> }
>>> }
>>> ------------------------------------------------------------
>>> Now trunk takes about twice running time
>>> compared with trunk + this
>>> patch. That is because trunk shrinks the
>>> store of a.f3 in foo (Done by
>>> DagCombiner) but not shrink the load of a.f3
>>> in goo, so store
>>> forwarding will be blocked.
>>>
>>>
>>> I can confirm that this is true on Haswell and
>>> also on an POWER8.
>>> Interestingly, on a POWER7, the performance is
>>> the same either way (on the
>>> good side). I ran the test case as presented and
>>> where I replaced f3 with a
>>> non-bitfield unsigned char member. Thinking that
>>> the POWER7 result might be
>>> because it's big-Endian, I flipped the order of
>>> the fields, and found that
>>> the version where f3 is not a bitfield is faster
>>> than otherwise, but only by
>>> 12.5%.
>>>
>>> Why, in this case, don't we shrink the load? It
>>> seems like we should (and it
>>> seems like a straightforward case).
>>>
>>> Thanks again,
>>> Hal
>>>
>>> Hal, thanks for trying the test.
>>>
>>> Yes, it is straightforward to shrink the load in the
>>> test. I can
>>> change the testcase a little to show why it is
>>> sometime difficult to
>>> shrink the load:
>>>
>>> class A {
>>> public:
>>> unsigned long f1:16;
>>> unsigned long f2:16;
>>> unsigned long f3:16;
>>> unsigned long f4:8;
>>> };
>>> A a;
>>> bool b;
>>> unsigned long N = 1000000000;
>>>
>>> __attribute__((noinline))
>>> void foo() {
>>> a.f4 = 3;
>>> }
>>>
>>> __attribute__((noinline))
>>> void goo() {
>>> b = (a.f4 == 0 && a.f3 == (unsigned short)-1);
>>> }
>>>
>>> int main() {
>>> unsigned long i;
>>> for (i = 0; i < N; i++) {
>>> foo();
>>> goo();
>>> }
>>> }
>>>
>>> For the load a.f4 in goo, it is diffcult to motivate
>>> its shrink after
>>> instcombine because the comparison with a.f3 and the
>>> comparison with
>>> a.f4 are merged:
>>>
>>> define void @_Z3goov() local_unnamed_addr #0 {
>>> %1 = load i64, i64* bitcast (%class.A* @a to
>>> i64*), align 8
>>> %2 = and i64 %1, 0xffffff00000000
>>> %3 = icmp eq i64 %2, 0xffff00000000
>>> %4 = zext i1 %3 to i8
>>> store i8 %4, i8* @b, align 1, !tbaa !2
>>> ret void
>>> }
>>>
>>>
>>> Exactly. But this behavior is desirable, locally.
>>> There's no good answer here: We either generate extra
>>> load traffic here (because we need to load the fields
>>> separately), or we widen the store (which generates
>>> extra load traffic there). Do you know, in terms of
>>> performance, which is better in this case (i.e., is it
>>> better to widen the store or split the load)?
>>>
>>> -Hal
>>>
>>>
>>>
>>> Thanks,
>>> Wei.
>>>
>>> The testcases shows the potential problem of
>>> store shrinking. Before
>>> we decide to do store shrinking, we need to
>>> know all the related loads
>>> will be shrunk, and that requires IPA
>>> analysis. Otherwise, when load
>>> shrinking was blocked for some difficult
>>> case (Like the instcombine
>>> case described in
>>> https://www.mail-archive.com/cfe-commits@lists.llvm.org/msg65085.html
>>> <https://www.mail-archive.com/cfe-commits@lists.llvm.org/msg65085.html>),
>>> performance regression will happen.
>>>
>>> Wei.
>>>
>>>
>>>
>>> Repository:
>>> rL LLVM
>>>
>>> https://reviews.llvm.org/D36562
>>> <https://reviews.llvm.org/D36562>
>>>
>>>
>>>
>>> _______________________________________________
>>> llvm-commits mailing list
>>> llvm-commits at lists.llvm.org
>>> <mailto:llvm-commits at lists.llvm.org>
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits
>>> <http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-commits>
>>>
>>>
>>> --
>>> Hal Finkel
>>> Lead, Compiler Technology and Programming Languages
>>> Leadership Computing Facility
>>> Argonne National Laboratory
>>>
>>>
>>> --
>>> Hal Finkel
>>> Lead, Compiler Technology and Programming Languages
>>> Leadership Computing Facility
>>> Argonne National Laboratory
>>>
>>>
>>
>> --
>> Hal Finkel
>> Lead, Compiler Technology and Programming Languages
>> Leadership Computing Facility
>> Argonne National Laboratory
>>
>>
>
> --
> Hal Finkel
> Lead, Compiler Technology and Programming Languages
> Leadership Computing Facility
> Argonne National Laboratory
>
>
--
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20170904/4a0dc7ca/attachment.html>
More information about the llvm-commits
mailing list