[llvm] r324835 - [X86] Reduce Store Forward Block issues in HW
Saba, Lama via llvm-commits
llvm-commits at lists.llvm.org
Tue Feb 13 05:11:01 PST 2018
This is actually on purpose :) the 2 wide moves are causing "Store Forward Block" issue in this case, so they are costing extra ~13cycles.. more details can be found in the commit's description.
-----Original Message-----
From: hwennborg at google.com [mailto:hwennborg at google.com] On Behalf Of Hans Wennborg
Sent: Tuesday, February 13, 2018 3:02 PM
To: Saba, Lama <lama.saba at intel.com>
Cc: mboehme at google.com; llvm-commits <llvm-commits at lists.llvm.org>
Subject: Re: [llvm] r324835 - [X86] Reduce Store Forward Block issues in HW
Great!
It also looks like the new code is less efficient, with 8 moves instead of two wide ones. Is this also part of the AA problem, or is it something else?
On Tue, Feb 13, 2018 at 1:52 PM, Saba, Lama <lama.saba at intel.com> wrote:
> Hi,
>
> Thank you, I found the issue, some of the AA info (mem offsets) is not correctly updated when building the new copies, therefore the compiler allows himself to move some of the instructions up and overwriting the correct value.
> (the first 4 added instructions should be after movw $100, 20(%rsp) )
> I am working on a fix.
>
> -----Original Message-----
> From: hwennborg at google.com [mailto:hwennborg at google.com] On Behalf Of
> Hans Wennborg
> Sent: Tuesday, February 13, 2018 2:44 PM
> To: Saba, Lama <lama.saba at intel.com>
> Cc: mboehme at google.com; llvm-commits <llvm-commits at lists.llvm.org>
> Subject: Re: [llvm] r324835 - [X86] Reduce Store Forward Block issues
> in HW
>
> Attaching preprocessed source and a creduced version of that.
>
> I think maybe the creduced version was reduced to aggressively, the final "printf(&p)" looks very fishy.
>
> But looking at the asm difference from the full preprocessed might be good enough. Here's the diff before and after r324835.
>
> (I'm not entirely sure what's happening, but it looks like the new
> code fails to copy the "d\0" part of the string, $100.)
>
> --- /tmp/good 2018-02-13 13:35:15.230627030 +0100
> +++ /tmp/bad 2018-02-13 13:35:10.194682062 +0100
> @@ -1,47 +1,53 @@
> .text
> .file "a.cc"
> .globl main # -- Begin function main
> .p2align 4, 0x90
> .type main, at function
> main: # @main
> .cfi_startproc
> # %bb.0: #
> %_ZN9__gnu_cxx17__sso_string_baseIcSt11char_traitsIcESaIcEED2Ev.exit
> pushq %rbx
> .cfi_def_cfa_offset 16
> subq $64, %rsp
> .cfi_def_cfa_offset 80
> .cfi_offset %rbx, -16
> leaq 48(%rsp), %rbx
> movq %rbx, 32(%rsp)
> leaq 16(%rsp), %rax
> movq %rax, (%rsp)
> movl $1819438967, 16(%rsp) # imm = 0x6C726F77
> + movl 16(%rsp), %eax
> + movl %eax, 48(%rsp)
> + movzwl 20(%rsp), %eax
> + movw %ax, 52(%rsp)
> movw $100, 20(%rsp)
> - movups 16(%rsp), %xmm0
> - movups %xmm0, 48(%rsp)
> + movq 22(%rsp), %rax
> + movq %rax, 54(%rsp)
> + movzwl 30(%rsp), %eax
> + movw %ax, 62(%rsp)
> movq $5, 40(%rsp)
> movq %rbx, %rdi
> callq puts
> movq 32(%rsp), %rdi
> cmpq %rbx, %rdi
> je .LBB0_2
> # %bb.1: # %if.then.i.i.i
> callq _ZdlPv
> .LBB0_2: #
> %_ZN21ExplicitlyConstructedIN9__gnu_cxx14__versa_stringIcSt11char_traitsIcESaIcENS0_17__sso_string_baseEEEE8DestructEv.exit
> xorl %eax, %eax
> addq $64, %rsp
> popq %rbx
> retq
> .Lfunc_end0:
> .size main, .Lfunc_end0-main
> .cfi_endproc
> # -- End function
> .type .L.str, at object # @.str
> .section .rodata.str1.1,"aMS", at progbits,1
> .L.str:
> .asciz "world"
> .size .L.str, 6
>
>
> - .ident "clang version 7.0.0 (trunk 324828) (llvm/trunk 324834)"
> + .ident "clang version 7.0.0 (trunk 324828) (llvm/trunk 324835)"
> .section ".note.GNU-stack","", at progbits
>
> On Tue, Feb 13, 2018 at 12:20 PM, Saba, Lama <lama.saba at intel.com> wrote:
>> Thanks! I will check it
>>
>> -----Original Message-----
>> From: hwennborg at google.com [mailto:hwennborg at google.com] On Behalf Of
>> Hans Wennborg
>> Sent: Tuesday, February 13, 2018 12:03 PM
>> To: Saba, Lama <lama.saba at intel.com>
>> Cc: mboehme at google.com; llvm-commits <llvm-commits at lists.llvm.org>
>> Subject: Re: [llvm] r324835 - [X86] Reduce Store Forward Block issues
>> in HW
>>
>> Martin was able to reduce his repro to a stand-alone C++ program, and I managed to get it working outside our internal environment.
>>
>> a.cc:
>>
>> #include <cstdint>
>> #include <iostream>
>> #include "/usr/include/c++/6/ext/vstring.h"
>>
>> using int64 = std::int64_t;
>>
>> template <typename T>
>> class ExplicitlyConstructed {
>> public:
>> void DefaultConstruct() {
>> new (&union_) T();
>> }
>>
>> void Destruct() {
>> get_mutable()->~T();
>> }
>>
>> constexpr const T& get() const { return reinterpret_cast<const T&>(union_); }
>> T* get_mutable() { return reinterpret_cast<T*>(&union_); }
>>
>> private:
>> // Prefer c++14 aligned_storage, but for compatibility this will do.
>> union AlignedUnion {
>> char space[sizeof(T)];
>> int64 align_to_int64;
>> void* align_to_ptr;
>> } union_;
>> };
>>
>> int main() {
>> ExplicitlyConstructed<__gnu_cxx::__versa_string<char>> str;
>> str.DefaultConstruct();
>> *str.get_mutable() = __gnu_cxx::__versa_string<char>("world", 5);
>> if (str.get() != "world") {
>> std::cout << "Expected 'world' but got '" << str.get() << "'\n";
>> }
>> str.Destruct();
>>
>> return 0;
>> }
>>
>> bin/clang++ -O2 /tmp/a.cc && ./a.out
>> Expected 'world' but got 'worl'
>>
>>
>> I'll reduce further to get rid off the library dependencies.
>>
>>
>>
>> On Mon, Feb 12, 2018 at 5:58 PM, Saba, Lama <lama.saba at intel.com> wrote:
>>> Hmm.. I'm not too familiar with protobufs, is there a way I can take
>>> a look at the generated assembly or ll?
>>>
>>> I see that the last 4 bytes are not getting copied, so yeah it's
>>> probably the memcpy at initialization/return but my attempts to
>>> reproduce did not work..
>>>
>>>
>>>
>>>
>>>
>>> From: Martin Böhme [mailto:mboehme at google.com]
>>> Sent: Monday, February 12, 2018 4:56 PM
>>> To: Saba, Lama <lama.saba at intel.com>
>>> Cc: hans at chromium.org
>>> Subject: Re: [llvm] r324835 - [X86] Reduce Store Forward Block
>>> issues in HW
>>>
>>>
>>>
>>> I've got an internal repro that essentially looks like this:
>>>
>>>
>>>
>>> proto file:
>>>
>>>
>>>
>>> message MyProto {
>>>
>>> optional bytes foo = 1 [default = "world"];
>>>
>>> }
>>>
>>>
>>>
>>> test file:
>>>
>>>
>>>
>>> TEST(ProtoReproTest, Repro) {
>>>
>>> MyProto proto;
>>>
>>> EXPECT_EQ("world", proto.foo());
>>>
>>> }
>>>
>>>
>>>
>>> This fails with an actual value for proto.foo() of "worl\0". I
>>> assume it's probably not a coincidence that four bytes are getting copied correctly.
>>> Maybe this already gives you a theory to start with?
>>>
>>>
>>>
>>> I'm still working to get this building in a way that reproduces externally.
>>> My hope is that I'll be able to get a repro that doesn't even use
>>> protos -- it looks to me as if the error is already happening when
>>> the string gets initialized.
>>>
>>>
>>>
>>>
>>>
>>> On Mon, 12 Feb 2018 at 15:42, Saba, Lama <lama.saba at intel.com> wrote:
>>>
>>> Okay thanks.
>>>
>>> -----Original Message-----
>>> From: hwennborg at google.com [mailto:hwennborg at google.com] On Behalf
>>> Of Hans Wennborg
>>> Sent: Monday, February 12, 2018 4:39 PM
>>> To: Saba, Lama <lama.saba at intel.com>; Martin Böhme
>>> <mboehme at google.com>
>>> Subject: Re: [llvm] r324835 - [X86] Reduce Store Forward Block
>>> issues in HW
>>>
>>> I don't have anything small, it would require building a part of Chromium.
>>>
>>> Martin also saw miscompiles internally, concerning initialization of
>>> protobuf fields. I think he's working on a repro.
>>>
>>> On Mon, Feb 12, 2018 at 3:36 PM, Saba, Lama <lama.saba at intel.com> wrote:
>>>> Hi,
>>>>
>>>> Thanks for notifying, I am fixing the assertion problem, regarding
>>>> the other failures:
>>>> In some of Chromium's 64-bit builds, we're seeing test failures
>>>> that started around the same time frame as this change, for example:
>>>> https://ci.chromium.org/buildbot/chromium.clang/ToTWin64/920
>>>> So besides the assert, I suspect this is causing miscompiles.
>>>>
>>>> Is there a way I can get a reproducer for this?
>>>>
>>>> Thanks,
>>>> Lama
>>>>
>>>> -----Original Message-----
>>>> From: hwennborg at google.com [mailto:hwennborg at google.com] On Behalf
>>>> Of Hans Wennborg
>>>> Sent: Monday, February 12, 2018 2:47 PM
>>>> To: Saba, Lama <lama.saba at intel.com>
>>>> Cc: llvm-commits <llvm-commits at lists.llvm.org>
>>>> Subject: Re: [llvm] r324835 - [X86] Reduce Store Forward Block
>>>> issues in HW
>>>>
>>>> I've reverted this in r324887 as it broke the Chromium build, see
>>>> https://bugs.llvm.org/show_bug.cgi?id=36346
>>>>
>>>> On Sun, Feb 11, 2018 at 10:34 AM, Lama Saba via llvm-commits
>>>> <llvm-commits at lists.llvm.org> wrote:
>>>>> Author: lsaba
>>>>> Date: Sun Feb 11 01:34:12 2018
>>>>> New Revision: 324835
>>>>>
>>>>> URL: http://llvm.org/viewvc/llvm-project?rev=324835&view=rev
>>>>> Log:
>>>>> [X86] Reduce Store Forward Block issues in HW
>>>>>
>>>>> If a load follows a store and reloads data that the store has
>>>>> written to memory, Intel microarchitectures can in many cases
>>>>> forward the data directly from the store to the load, This "store
>>>>> forwarding" saves cycles by enabling the load to directly obtain
>>>>> the data instead of accessing the data from cache or memory.
>>>>> A "store forward block" occurs in cases that a store cannot be
>>>>> forwarded to the load. The most typical case of store forward
>>>>> block on Intel Core microarchiticutre that a small store cannot be forwarded to a large load.
>>>>> The estimated penalty for a store forward block is ~13 cycles.
>>>>>
>>>>> This pass tries to recognize and handle cases where "store forward block"
>>>>> is created by the compiler when lowering memcpy calls to a
>>>>> sequence of a load and a store.
>>>>>
>>>>> The pass currently only handles cases where memcpy is lowered to
>>>>> XMM/YMM registers, it tries to break the memcpy into smaller copies.
>>>>> breaking the memcpy should be possible since there is no atomicity
>>>>> guarantee for loads and stores to XMM/YMM.
>>>>>
>>>>> Change-Id: I620b6dc91583ad9a1444591e3ddc00dd25d81748
>>>>>
>>>>> Added:
>>>>> llvm/trunk/lib/Target/X86/X86FixupSFB.cpp
>>>>> llvm/trunk/test/CodeGen/X86/fixup-sfb.ll
>>>>> Modified:
>>>>> llvm/trunk/lib/Target/X86/CMakeLists.txt
>>>>> llvm/trunk/lib/Target/X86/X86.h
>>>>> llvm/trunk/lib/Target/X86/X86TargetMachine.cpp
>> ---------------------------------------------------------------------
>> Intel Israel (74) Limited
>>
>> This e-mail and any attachments may contain confidential material for
>> the sole use of the intended recipient(s). Any review or distribution
>> by others is strictly prohibited. If you are not the intended
>> recipient, please contact the sender and delete all copies.
> ---------------------------------------------------------------------
> Intel Israel (74) Limited
>
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
---------------------------------------------------------------------
Intel Israel (74) Limited
This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.
More information about the llvm-commits
mailing list