[cfe-commits] [PATCH] Initial version of formatting library

Thu Nov 15 12:11:15 PST 2012

  I think what this needs is a bit more principled approach to the task of reformatting. I propose the following model, based roughly on what it looks like you are implementing in this patch:

  The task is broken up into two main parts governed by orthogonal policies (and implemented in their own, appropriately named file(s)).

  #  Identify "logical lines" (what you call "continuations"). The way to think about this is that if theoretically there were no line length limit, then how would the source be laid out. The tokens which end up being on each "logical line" are handled separately. Your comment about "the key property" in the comment above Continuation is truly pure gold: basically, what this says is that identifying logical lines is orthogonal to...
  # Wrapping logical lines to the desired physical source width. This has an independent set of policy decisions associated with it and deals with a completely different set of abstractions. Really the only input from from the last step is how deep this logical line is indented (and hence how much to subtract from the available physical line length).

  Basically, the task and corresponding configuration policy associated with each part can be cleanly separated. The configurability at each step is:

  # How to identify logical lines.
    * There can be a predetermined set of locations at which we call "ship out the next logical line of accumulated tokens", and during the "parsing", at key places we do like `if (LineBreakConf.shouldBreakBeforeIfBrace()) shipOutNextLogicalLine();`.
    This is an easily extensible mechanism and provides a clear *model* for extensibility.
    * if we really want to get crazy, we can offer a callback here and let the user decide on a case-by case basis.

  # How to wrap logical lines to a physical source width and place tokens.
    * This is a bit more complicated. Once again, we can provide a set of predetermined places to break (e.g. "after `&&`") a set of predetermined policies for how to align wrapped lines, and how to place tokens ("put the * on the right"), along with relative weights for each one. These can all be boiled down into a set of weights parametrizing where to break, how to align things, and where to put spaces (similar to TeX's notion of "badness"). Actually, this is basically the same task as what TeX does when breaking lines to put together paragraphs, so rather than rolling your own line breaking algorithm, you might consider just using the TeX algorithm (see Knuth & Plass, "Breaking Paragraphs into Lines", Software Practice and Experience, 1981).

  There is probably a use case for a "peephole" pass at the end, to deal with things like "no braces if the `if` only has a single statement for its body".

  Having a model like this is a HUGE boon for understandability, usability, and maintenance of the library.

================
Comment at: lib/Format/ContinuationParser.h:9-12
@@ +8,6 @@
+//===----------------------------------------------------------------------===//
+//
+//  This is EXPERIMENTAL code under heavy development. It is not in a state yet,
+//  where it can be used to format real code.
+//
+//===----------------------------------------------------------------------===//
----------------
http://llvm.org/docs/CodingStandards.html#file-headers

Describe the file. Describe the "theory of operations".

================
Comment at: lib/Format/ContinuationParser.h:49-55
@@ +48,9 @@
+
+/// \brief A continuation is a sequence of \c Token, that we would like to put
+/// on a single line if there was no column limit.
+///
+/// This is used as a main interface between the \c ContinuationParser and the
+/// \c ContinuationFormatter. The key property is that changing the formatting
+/// within a continuation does not affect any other continuation.
+struct Continuation {
+  Continuation() : Level(0) {}
----------------
Why is this called "continuation"? The name doesn't seem to have anything to do with the description; the only meaning of "continuation" as a noun that I can think of is "line continuation", which is the separating characters that join the logical line across physical source lines (e.g. http://msdn.microsoft.com/en-us/library/aa711641(v=vs.71).aspx).

How about "LogicalLine"?

Your sentence "the key property" is pure gold, btw.

================
Comment at: lib/Format/ContinuationParser.h:58-59
@@ +57,4 @@
+
+  /// \brief The \c Token comprising this continuation.
+  std::vector<FormatToken> Tokens;
+
----------------
Prefer SmallVector

================
Comment at: lib/Format/ContinuationParser.cpp:9-12
@@ +8,6 @@
+//===----------------------------------------------------------------------===//
+//
+//  This is EXPERIMENTAL code under heavy development. It is not in a state yet,
+//  where it can be used to format real code.
+//
+//===----------------------------------------------------------------------===//
----------------
Proper file header needed here too.

================
Comment at: lib/Format/Format.cpp:45
@@ +44,3 @@
+    State.Indent.push_back(Cont.Level * 2 + 4);
+    for (unsigned i = 1, e = Cont.Tokens.size(); i != e; ++i) {
+      bool InsertNewLine = Cont.Tokens[i].NewlinesBefore > 0;
----------------
http://llvm.org/docs/CodingStandards.html#name-types-functions-variables-and-enumerators-properly

I and N are the canonical names for integer loop variables (I and E moreso for iterators).

also, comment why you are iterating starting at 1.

================
Comment at: lib/Format/Format.cpp:58
@@ +57,3 @@
+private:
+  /// \brief The current state when indenting a continuation.
+  ///
----------------
Ugh. Yeah, "logical line" would be a lot better.

================
Comment at: lib/Format/Format.cpp:86
@@ +85,3 @@
+
+  // Append the token at 'Index' to the IndentState 'State'.
+  void addToken(unsigned Index, bool Newline, bool DryRun, IndentState &State) {
----------------
Doxygenify

http://llvm-reviews.chandlerc.com/D80