[LLVMdev] RFC: Codifying (but not formalizing) the optimization levels in LLVM and Clang

Mon Jan 14 10:12:57 PST 2013

----- Original Message -----
> From: "Chandler Carruth" <chandlerc at gmail.com>
> To: "LLVM Developers Mailing List" <llvmdev at cs.uiuc.edu>, "clang-dev Developers" <cfe-dev at cs.uiuc.edu>
> Sent: Monday, January 14, 2013 3:09:01 AM
> Subject: [LLVMdev] RFC: Codifying (but not formalizing) the optimization levels in LLVM and Clang
> 
> 
> 
> 
> This has been an idea floating around in my head for a while and
> after several discussions with others it continues to hold up so I
> thought I would mail it out. Sorry for cross posting to both lists,
> but this is an issue that would significantly impact both LLVM and
> Clang.
> 
> 
> Essentially, LLVM provides canned optimization "levels" for frontends
> to re-use. This is nothing new. However, we don't have good names
> for them, we don't expose them at the IR level, and we have a hard
> time figuring out which optimizations belong in which levels. I'd
> like to try addressing that by coming up with names and a
> description of the basic intend goal of each level. I would like, if
> folks are happy with these ideas, to add these types of descriptions
> along side these attributes to the langref. Ideas on other (better?)
> places to document this would be welcome. Certainly, Clang's
> documentation would need to be updated to reflect this.
> 
> 
> Hopefully we can minimally debate this until the bikeshed is a
> tolerable shade. Note that I'm absolutely biased based on the
> behavior of Clang and GCC with these optimization levels, and the
> relevant history there. However, I'm adding and deviating from the
> purely historical differences to try and better model the latest
> developments in LLVM's optimizer... So here goes:
> 
> 
> 
> 
> 1) The easiest: 'Minimize Size' or '-Oz'
> - Attribute: minsize (we already have it, nothing to do here)
> 
> - Goal: minimize the size of the resulting binary, at (nearly) any
> cost.
> 
> 
> 
> 
> 2) Optimize for size or '-Os'
> - Attribute: optsize (we already have it, nothing to do here)
> - Goal: Optimize the execution of the binary without unreasonably[1]
> increasing the binary size.
> This one is a bit fuzzy, but usually people don't have a hard time
> figuring out where the line is. The primary difference between
> minsize and optsize is that with minsize a pass is free to *hurt*
> performance to shrink the size.
> 
> 
> [1] The definition of 'unreasonable' is of course subjective, but
> here is at least one strong indicator: any code size growth which is
> inherently *speculative* (that is, there isn't a known, demonstrable
> performance benefit, but rather it is "often" or "maybe" a benefit)
> is unlikely to be a good fit in optsize. The canonical example IMO
> is a vectorizer -- while it is reasonable to vectorize a loop, if
> the vector version might not be executed, and thus the scalar loop
> remains as well, then it is a poor fit for optsize.
> 
> 
> 
> 
> 3) Optimize quickly or '-O1'
> - Attribute: quickopt (this would be a new attribute)
> - Goal: Perform basic optimizations to improve both performance and
> simplicity of the code, but perform them *quickly*.
> This level is all about compile time, but in a holistic sense. It
> tries to perform basic optimizations to get reasonably efficient
> code, and get it very quickly.
> 
> 
> 
> 
> 4) Good, well-balanced optimizations, or '-O2'
> - Attribute: opt (new attribute)
> - Goal: produce a well optimized binary trading off compile time,
> space, and runtime efficiency.
> This should be an excellent default for general purpose programs. The
> idea is to do as much optimization as we can, in as reasonable of a
> time frame, and with as reasonable code size impact as possible.
> This level should always produce binaries at least as fast as
> optsize, but they might be both bigger and faster. This level should
> always produce binaries at least as fast as quickopt, but they might
> be both slower to compile.
> 
> 
> 
> 
> 5) Optimize to the max or '-O3'
> - Attribute: maxopt (new attribute)
> - Goal: produce the fastest binary possible.
> This level has historically been almost exclusively about trading off
> more binary size for speed than '-O2', but I would propose we change
> it to be more about trading off either binary size or compilation
> time to achieve a better performing binary. This level should always
> produce binaries at least as fast as opt, but they might be faster
> at the cost of them being larger and taking more time to compile.
> This would in some cases be a change for LLVM and is definitely a
> deviation from GCC where O3 will in many cases produce *slower*
> binaries due to code size increases that are not accompanied by
> corresponding performance increases.
> 
> 
> 
> 
> To go with these LLVM attributes I'd like to add support for adding
> attributes in Clang, both compatible with GCC and with the names
> above for clarity. The goal being to allow a specific function to
> have its optimization level overridden from the command line based
> level.
> 
> 
> 
> 
> A final note: I would like to remove all other variations on the '-O'
> flag. That includes the really strange '-O4' behavior. Whether the
> compilation is LTO should be an orthogonal decision to the
> particular level of optimization, and we have -flto to achieve this.

I agree that -O4 and LTO should be separated. In general, we need an easy way for passes (like the vectorizer, loop unroller, etc.) to know not only the current optimization level (which should be easy if these are all function attributes) but also whether they are executing before or during the "final" optimization phase (because there are some things you don't want to do if you'll be later using LTO).

Nevertheless, I think that we definitely do want optimization levels above -O3. None of these should make code slower, but provide a rough indication of how long the user is willing to wait for the compilation process to complete. For example, currently when -vectorize is given, not only is basic-block vectorization enabled, but so are an extra invocation of EarlyCSE and InstCombine. These latter two passes run even if BBVectorizer does nothing (which indicates another needed infrastructure improvement, but that's another story), and sometimes produce a small speedup independent of the vectorizer. Because of the compile-time impact, there might not be sufficient justification for enabling these passes by default at -O3, but it would make sense to do so at a -O4 level: The user wants the compiler to 'try harder'.  

There are several parts of the compiler that are essentially solving combinatorial optimization problems. Basic-block vectorization is an obvious example, but instruction scheduling can also fall into this category, as can some forms of reassociation, inlining, etc. We currently use heuristics and cutoffs to deal with this, but given the asymptotic scaling of each algorithm, we could set these cutoffs, and choose the approximation algorithm to use, based on the current optimization level. I'd recommend something like this: -O4 can take twice as long as -O3, -O5 can take twice as long as -O4, etc.

 -Hal

> 
> 
> -Chandler
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> 

-- 
Hal Finkel
Postdoctoral Appointee
Leadership Computing Facility
Argonne National Laboratory