[llvm-dev] Load combine pass
Sanjoy Das via llvm-dev
llvm-dev at lists.llvm.org
Thu Sep 29 11:01:59 PDT 2016
Hi Artur,
Artur Pilipenko wrote:
> BTW, do we really need to emit an atomic load if all the individual
> components are bytes?
Depends -- do you mean at the at the hardware level or at the IR
level?
If you mean at the IR level, then I think yes; since otherwise it is
legal to do transforms that break byte-wise atomicity in the IR, e.g.:
i32* ptr = ...
i32 val = *ptr
=> // Since no threads can be legally racing on *ptr
i32* ptr = ...
i32 val0 = *ptr
i32 val1 = *ptr
i32 val = (val0 & 1) | (val1 & ~1);
If you're talking about the hardware level, then I'm not sure; and my
guess is that the answer is almost certainly arch-dependent.
-- Sanjoy
>
> Artur
>> Given this, perhaps scheduling load widening after one pass of GVN/PRE
>> is fine?
>>
>> -- Sanjoy
>>
>>> I didn’t think of atomicity aspect though.
>>>
>>> Artur
>>>
>>>> On 28 Sep 2016, at 18:50, Philip Reames<listmail at philipreames.com>
wrote:
>>>>
>>>> There's a bit of additional context worth adding here...
>>>>
>>>> Up until very recently, we had a form of widening implemented in
GVN. We decided to remove this in https://reviews.llvm.org/D24096
precisely because its placement in the pass pipeline was inhibiting
other optimizations. There's also a major problem with doing widening at
the IR level which is that widening a pair of atomic loads into a single
wider atomic load can not be undone. This creates a major pass ordering
problem of its own.
>>>>
>>>> At this point, my general view is that widening transformations of
any kind should be done very late. Ideally, this is something the
backend would do, but doing it as a CGP like fixup pass over the IR is
also reasonable.
>>>>
>>>> With that in mind, I feel both the current placement of
LoadCombine (within the inliner iteration) and the proposed InstCombine
rule are undesirable.
>>>>
>>>> Philip
>>>>
>>>>
>>>> On 09/28/2016 08:22 AM, Artur Pilipenko wrote:
>>>>> Hi,
>>>>>
>>>>> I'm trying to optimize a pattern like this into a single i16 load:
>>>>> %1 = bitcast i16* %pData to i8*
>>>>> %2 = load i8, i8* %1, align 1
>>>>> %3 = zext i8 %2 to i16
>>>>> %4 = shl nuw i16 %3, 8
>>>>> %5 = getelementptr inbounds i8, i8* %1, i16 1
>>>>> %6 = load i8, i8* %5, align 1
>>>>> %7 = zext i8 %6 to i16
>>>>> %8 = shl nuw nsw i16 %7, 0
>>>>> %9 = or i16 %8, %4
>>>>>
>>>>> I came across load combine pass which is motivated by virtualliy
the same pattern. Load combine optimizes the pattern by combining
adjacent loads into one load and lets the rest of the optimizer cleanup
the rest. From what I see on the initial review for load combine
(https://reviews.llvm.org/D3580) it was not enabled by default because
it caused some performance regressions. It's not very surprising, I see
how this type of widening can obscure some facts for the rest of the
optimizer.
>>>>>
>>>>> I can't find any backstory for this pass, why was it chosen to
optimize the pattern in question in this way? What is the current status
of this pass?
>>>>>
>>>>> I have an alternative implementation for it locally. I
implemented an instcombine rule similar to recognise bswap/bitreverse
idiom. It relies on collectBitParts (Local.cpp) to determine the origin
of the bits in a given or value. If all the bits are happen to be loaded
from adjacent locations it replaces the or with a single load or a load
plus bswap.
>>>>>
>>>>> If the alternative approach sounds reasonable I'll post my
patches for review.
>>>>>
>>>>> Artur
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> llvm-dev at lists.llvm.org
>>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>
More information about the llvm-dev
mailing list