[LLVMdev] Memset/memcpy: user control of loop-idiom recognizer

Sat Dec 6 05:06:31 PST 2014

----- Original Message -----
> From: "Kevin B Smith" <kevin.b.smith at intel.com>
> To: "Hal Finkel" <hfinkel at anl.gov>
> Cc: "LLVM Developers Mailing List" <llvmdev at cs.uiuc.edu>, "Philip Reames" <listmail at philipreames.com>, "David
> Chisnall" <david.chisnall at cl.cam.ac.uk>, "Robert Lougher" <rob.lougher at gmail.com>
> Sent: Friday, December 5, 2014 10:05:49 PM
> Subject: RE: [LLVMdev] Memset/memcpy: user control of loop-idiom recognizer
> 
> Hal,
> 
> I appreciate the clarification.  That was what I was expecting (that
> the transformation uses intrinsics), Intel compiler does the same
> thing internally, and like
> LLVM it is into an internal intrinsic,
> not a plain library call.  Nevertheless, there are a huge number of
> ways (In machine code) to write "the best" memory copy or memory set
> sort of code
> if, as a programmer, you are able to constrain the parameters in many
> of the ways I was referring to.  And often, the loops that implement
> these equivalences
> have those conditions programmed into them, but with no real way to
> indicate that to the compilation system.  That sometimes makes it
> very tricky (as Rob
> is bringing up) for the lowering of these intrinsics to do as good of
> a job as the original loop did.  Now as a counterpoint, of course
> there are also a bunch of
> cases where the compiler will do MUCH better than the original loop
> as well, and that is why both the LLVM and Intel compilation systems
> have made the
> effort to do this transformation.
> 
> I'm just trying to point out that the transformation from loop to
> intrinsic is lossy in a number of ways, that even if it wasn't
> lossy, the number of possible lowerings
> results in a huge search space for the best lowering, and that
> therefore, I think it is definitely worth considering what a
> reasonable way might be to throttle
> the loop->intrinsic transformation based on some IR level hint coming
> from the programmer and through the front-end.

Hi Kevin,

I don't disagree, but if we can come up with a reasonable way of describing this space, then using this description to hint the memcpy intrinsic might be better than a binary recognize/don't-recognize switch. It is not yet clear to me. Quickly, I can think of a few:
 - Alignment (we currently provide one alignment, but the source and destination can have different alignments)
 - Direction (should the memory be traversed forward or backward)
 - Blocking factor and direction (how much memory should be loaded/stored "at a time", and in what order should those loads/stores be issued)
 - Load/store size (what data type was used for the individual loads/stores)
 - Cache hinting (if we do idiom recognition on target-specific intrinsics, we'd need to capture whether the stores were non-temporal, etc.)

Thanks again,
Hal

> 
> Kevin
> 
> -----Original Message-----
> From: Hal Finkel [mailto:hfinkel at anl.gov]
> Sent: Friday, December 05, 2014 5:45 PM
> To: Smith, Kevin B
> Cc: LLVM Developers Mailing List; Philip Reames; David Chisnall;
> Robert Lougher
> Subject: Re: [LLVMdev] Memset/memcpy: user control of loop-idiom
> recognizer
> 
> ----- Original Message -----
> > From: "Kevin B Smith" <kevin.b.smith at intel.com>
> > To: "Philip Reames" <listmail at philipreames.com>, "David Chisnall"
> > <david.chisnall at cl.cam.ac.uk>, "Robert Lougher"
> > <rob.lougher at gmail.com>
> > Cc: "LLVM Developers Mailing List" <llvmdev at cs.uiuc.edu>
> > Sent: Friday, December 5, 2014 1:06:14 PM
> > Subject: Re: [LLVMdev] Memset/memcpy: user control of loop-idiom
> > recognizer
> > 
> > There are a large number of ways to lose information in translating
> > loops into memset/memcpy calls, alignment is one of them.
> > As previously mentioned, loop-trip-count is another.  Another is
> > size
> > of accesses.  For example, the loop may have originally been using
> > int64_t sized copies.  This has definite impact on what the best
> > memset/memcpy expansion is, because effectively, the loop knows
> > that
> > it is always writing a multiple of 8 bytes, and does so in 8 bytes
> > chunks.  So, that the number of bytes has some specific value
> > property (like the lower 3 bits
> > are always 0, another reason for having known bits and known bit
> > values :-)) all (should) affect the lowering of such loops/calls,
> > but probably doesn't.
> 
> Hi Kevin,
> 
> Just so everyone is on the same page, when we convert a loop to a
> memcpy intrinsic, we're really talking about this:
> http://llvm.org/docs/LangRef.html#llvm-memcpy-intrinsic -- and this
> intrinsic carries alignment information. Now one problem is that it
> carries only one alignment specifier, not separate ones for the
> source and destination, and we may want to improve that.
> Nevertheless, I want everyone to understand that we're not just
> transforming these loops into libc calls, but into intrinsics, and
> the targets then control whether these are expanded, and how, or
> turned into actual libc calls.
> 
> > 
> > Database folks often write their own copy routines for use in
> > specific instances, as do OSes, such as when they know they are
> > clearing or copying exact
> > page size on exact page-size boundaries, and have very special
> > implementations of these, including some that will use non-temporal
> > hints, so as not to
> > pollute cache.
> 
> I don't think we perform loop idiom recognition based on
> target-specific intrinsics (such as those providing non-temporal
> stores).
> 
>  -Hal
> 
> > 
> > It is also worth pointing out that most loops have a very specific
> > behavior in the case of overlaps that is well-defined, and that
> > memcpy does not.
> > 
> > There are definitely good reasons why various knowledgeable users
> > would not want a compiler to perform such a transform on at least
> > some of their loops.
> > 
> > Kevin Smith
> > 
> > -----Original Message-----
> > From: llvmdev-bounces at cs.uiuc.edu
> > [mailto:llvmdev-bounces at cs.uiuc.edu] On Behalf Of Philip Reames
> > Sent: Friday, December 05, 2014 10:08 AM
> > To: David Chisnall; Robert Lougher
> > Cc: LLVM Developers Mailing List
> > Subject: Re: [LLVMdev] Memset/memcpy: user control of loop-idiom
> > recognizer
> > 
> > 
> > On 12/04/2014 11:46 PM, David Chisnall wrote:
> > > On 3 Dec 2014, at 23:36, Robert Lougher <rob.lougher at gmail.com>
> > > wrote:
> > >
> > >> On 2 December 2014 at 22:18, Alex Rosenberg
> > >> <alexr at leftfield.org>
> > >> wrote:
> > >>> Our C library amplifies this problem by being in a dynamic
> > >>> library, so the
> > >>> call has additional overhead, which for small trip counts
> > >>> swamps
> > >>> the
> > >>> copy/set.
> > >>>
> > >> I can't imagine we're the only platform (now or in the future)
> > >> that
> > >> has comparatively slow library calls.  We had discussed some
> > >> sort
> > >> of
> > >> platform flag (has slow library calls) but this would be too
> > >> late
> > >> to
> > >> affect the loop-idiom.  However, it could affect lowering.
> > >>  Following
> > >> on from Reid's earlier idea to lower short memcpys to an
> > >> inlined,
> > >> slightly widened loop, we could expand into a guarded loop for
> > >> small
> > >> values and a call?
> > > I think the bug is not that we are recognising that the loop is
> > > memcpy, it's that we're then generating an inefficient memcpy.
> > >  We
> > > do this for a variety of reasons, some of which apply elsewhere.
> > >  One issue I hit a few months ago was that the vectoriser doesn't
> > > notice whether unaligned loads and stores are supported, so will
> > > happily replace two adjacent i32 align 4 loads followed by two
> > > adjacent i64 align 4 stores with an i64 align 4 load followed by
> > > an i64 align 4 store, which more than doubles the number of
> > > instructions that the back end emits.
> > >
> > > We expand memcpy and friends in several different places (in the
> > > IR
> > > in at least one place, then in SelectionDAG, and then again in
> > > the
> > > back end, as I recall - I remember playing whack-a-bug with this
> > > for a while as the lowering was differently broken for our target
> > > in each place).  In SelectionDAG, we're dealing with a single
> > > basic block, so we can't construct the loop.  In the back end
> > > we've already lost a lot of high-level type information that
> > > would
> > > make this easier.
> > >
> > > I'd be in favour of consolidating the memcpy / memset / memmove
> > > expansion into an IR pass that would take a cost model from the
> > > target.
> > +1
> > 
> > It sounds like we might also be loosing information about alignment
> > in
> > the loop-idiom recognizer.  Or at least not using it when we lower.
> > >
> > > David
> > >
> > >
> > > _______________________________________________
> > > LLVM Developers mailing list
> > > LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> > > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> > 
> > _______________________________________________
> > LLVM Developers mailing list
> > LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> > 
> > _______________________________________________
> > LLVM Developers mailing list
> > LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> > 
> 
> --
> Hal Finkel
> Assistant Computational Scientist
> Leadership Computing Facility
> Argonne National Laboratory
> 

-- 
Hal Finkel
Assistant Computational Scientist
Leadership Computing Facility
Argonne National Laboratory