[PATCH] Move memset value into register

Tue Sep 17 23:39:21 PDT 2013

On Tue, Sep 17, 2013 at 11:23 PM, Serge Pavlov <sepavloff at gmail.com> wrote:

> Thank you for sharing thoughts, Sean.
>
> I wonder if any target other than x86 can have similar problem? Should
> this MI-level peephole pass be x86 specific?
>

I would start specific and generalize incrementally as we have concrete
test cases that benefit from it.

So, start with an x86-specific pass, specifically targetting patterns of
repeated immediate operands that can be replaced with a register operand.
This should have nothing to do with 'mov' instructions specifically,
instead focusing on immediate operands where encoding a register operand is
smaller. There should be a pretty reasonable set of heuristics here that
catch your memset case as well as cases like a series of "add -2"
instructions that sometimes come up in refcounting code.

We can look at more generalizations once this works well. =]

> Thanks,
> --Serge
>
>
> 2013/9/18 Sean Silva <silvas at purdue.edu>
>
>> On Wed, Sep 18, 2013 at 12:17 AM, Alex Rosenberg <alexr at leftfield.org>wrote:
>>
>>> FWIW, Sean ran into this exact issue earlier in the Summer. On x86,
>>> these long instructions with 32-bit immediate zeros in them added up to a
>>> substantial amount of code. IIRC, he noticed lots of those in global
>>> constructors.
>>>
>>> Sean?
>>>
>>>
>> The issue I saw this Summer was a regression of our internal branch vs.
>> trunk (i.e., trunk didn't have the code size issue). The issue appeared to
>> be fixed in newer versions of the toolchain repo that had merged up to LLVM
>> 3.3.
>>
>> I agree with Chandler though that this would best be done as an MI-level
>> peephole pass. Besides the register pressure thing, if optimizing for size,
>> there are some pretty "low-level" code size hazards that are basically due
>> to quirks in the instruction encoding, Off the top of my head I can think
>> of:
>> * once an offset from a base pointer becomes large enough (doesn't fit in
>> a signed byte), the immediate is encoded as 32-bit instead. Each time you
>> can avoid this you save 3 bytes of code size basically.
>> * Some AVX2 instructions interpret these immediates differently (as
>> multiples of the logical vector element size), which increases the actual
>> range of an imm8 operand, so the simple rule "keep the offset from the base
>> pointer so that it fits in an imm8" isn't quite so simple.
>> * some registers need larger encodings for the indexing (r8-r15 need
>> 1-byte REX (but that may already be required for a 64-bit operation);
>> rsp+imm8 needs a SIB, since the "usual" encoding for that is special-cased
>> for use for RIP-rel addressing). These usually will save you only 1 byte.
>>
>> -- Sean Silva
>>
>>
>>
>>> Alex
>>>
>>> On Sep 16, 2013, at 9:20 PM, Nadav Rotem <nrotem at apple.com> wrote:
>>>
>>> Hi,
>>>
>>> I am sorry for missing this patch the last time you sent it.  Do you
>>> have any performance numbers that show the profitability of this
>>> transformation ?  I also feel like SelectionDAG is not the right place to
>>> implement this kind of optimization. I think that introducing virtual
>>> registers would inhibit other optimizations on the DAG.  I also agree that
>>> this optimization can be generalized to any sequence of machine
>>> instructions that repeatedly use the same constant.
>>>
>>> I have a few comments on the patch itself.
>>>
>>> You can commit the documentation of this method even without the change.
>>>
>>> +/// \brief Lower the call to 'memset' intrinsic function into a series
>>> of store
>>> +/// operations.
>>> +///
>>> +/// \param DAG Selection DAG where lowered code is placed.
>>> +/// \param dl Link to corresponding IR location.
>>>
>>> ...
>>>
>>> +/// The function tries to replace 'llvm.memset' intrinsic with several
>>> store
>>> +/// operations and value calculation code. This is usually profitable
>>> for small
>>> +/// memory size.
>>>  static SDValue getMemsetStores(SelectionDAG &DAG, SDLoc dl,
>>>                                 SDValue Chain, SDValue Dst,
>>>                                 SDValue Src, uint64_t Size,
>>>
>>>
>>> The code below is a little confusing.  Can you please add braces to the
>>> else clause, or at least separate the next statement with a line break ?
>>>
>>>
>>> +  } else
>>> +    LongestValueInReg = MemSetValue;
>>> +  SDValue  RegInitialization = Chain;
>>> +
>>>
>>>
>>> Also here. Line break after the first "Value = “ line. I would also add
>>> braces around the last “Value = “ line.
>>>
>>> +          TLI.isTruncateFree(LargestVT, VT)) {
>>> +        if (ValueReg != 0)
>>> +          Value = DAG.getCopyFromReg(RegInitialization, dl, ValueReg,
>>> LargestVT);
>>> +        Value = DAG.getNode(ISD::TRUNCATE, dl, VT, Value);
>>> +      }
>>>        else
>>>          Value = getMemsetValue(Src, VT, DAG, dl);
>>>      }
>>>
>>>
>>> Thanks,
>>> Nadav
>>>
>>>
>>> On Sep 16, 2013, at 2:55 PM, Chandler Carruth <chandlerc at google.com>
>>> wrote:
>>>
>>> I wonder whether this is the best approach.
>>>
>>> Specifically, this is trading off register pressure for code size. I
>>> think it is probably the correct call in many cases, but perhaps not in all
>>> cases. Also, it has nothing to do with memset, and everything to do with
>>> any sequence of immediate operand movs.
>>>
>>> I think this would be done somewhat better as an MI peephole that takes
>>> into account register pressure and how many mov instructions share the
>>> immediate to tradeoff the register pressure and potential instructions to
>>> materialize the constant into a register against the shrink of immediate
>>> operand movs.
>>>
>>> But yes, Nadav or Andy would have more insights here...
>>>
>>>
>>> On Mon, Sep 16, 2013 at 10:00 AM, Serge Pavlov <sepavloff at gmail.com>wrote:
>>>
>>>> Friendly ping.
>>>>
>>>>
>>>> 2013/9/10 Serge Pavlov <sepavloff at gmail.com>
>>>>
>>>>>   Cosmetical update to the patch
>>>>>   If block size is small, the size of largest move chunk may be
>>>>> smaller than the
>>>>>   natural register width and target may not have resisters of such
>>>>> width. Example
>>>>>   is 32-bit target that does not have 16-bit registers. Avoid
>>>>> allocation such
>>>>>   illegal registers.
>>>>>
>>>>> Hi bkramer, nadav,
>>>>>
>>>>> http://llvm-reviews.chandlerc.com/D1484
>>>>>
>>>>> CHANGE SINCE LAST DIFF
>>>>>   http://llvm-reviews.chandlerc.com/D1484?vs=3692&id=4161#toc
>>>>>
>>>>> Files:
>>>>>   include/llvm/Target/TargetLowering.h
>>>>>   lib/CodeGen/SelectionDAG/SelectionDAG.cpp
>>>>>   lib/Target/X86/X86ISelLowering.cpp
>>>>>   lib/Target/X86/X86ISelLowering.h
>>>>>   test/CodeGen/X86/memset-sse-stack-realignment.ll
>>>>>   test/CodeGen/X86/memset.ll
>>>>>   test/CodeGen/X86/tlv-1.ll
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Thanks,
>>>> --Serge
>>>>
>>>> _______________________________________________
>>>> llvm-commits mailing list
>>>> llvm-commits at cs.uiuc.edu
>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>>>
>>>>
>>>
>>> _______________________________________________
>>> llvm-commits mailing list
>>> llvm-commits at cs.uiuc.edu
>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>>
>>>
>>
>> _______________________________________________
>> llvm-commits mailing list
>> llvm-commits at cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>
>>
>
>
> --
> Thanks,
> --Serge
>
> _______________________________________________
> llvm-commits mailing list
> llvm-commits at cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-commits/attachments/20130917/dfaa0d5a/attachment.html>