[llvm] eb5af0a - [Symbolize] Add log markup --filter to llvm-symbolizer.

Daniel Thornburgh via llvm-commits llvm-commits at lists.llvm.org
Mon Jun 27 10:44:22 PDT 2022


Author: Daniel Thornburgh
Date: 2022-06-27T10:44:15-07:00
New Revision: eb5af0acf054a73d461ac768d8cb035ee2a64383

URL: https://github.com/llvm/llvm-project/commit/eb5af0acf054a73d461ac768d8cb035ee2a64383
DIFF: https://github.com/llvm/llvm-project/commit/eb5af0acf054a73d461ac768d8cb035ee2a64383.diff

LOG: [Symbolize] Add log markup --filter to llvm-symbolizer.

This adds a --filter option to llvm-symbolizer. This takes log-bearing
symbolizer markup from stdin and writes a human-readable version to
stdout.

For now, this only implements the "symbol" markup tag; all others are
passed through unaltered. This is a proof-of-concept bit of
functionalty; implement the various tags is more-or-less just a matter
of hooking up various parts of the Symbolize library to the architecture
established here.

Reviewed By: peter.smith

Differential Revision: https://reviews.llvm.org/D126980

Added: 
    llvm/docs/SymbolizerMarkupFormat.rst
    llvm/include/llvm/DebugInfo/Symbolize/MarkupFilter.h
    llvm/lib/DebugInfo/Symbolize/MarkupFilter.cpp
    llvm/test/DebugInfo/symbolize-filter-markup-color.test
    llvm/test/DebugInfo/symbolize-filter-markup-error-location.test
    llvm/test/DebugInfo/symbolize-filter-markup-symbol.test
    llvm/test/DebugInfo/symbolize-filter-markup-tag.test
    llvm/test/tools/llvm-symbolizer/filter-markup.test

Modified: 
    llvm/docs/CommandGuide/llvm-symbolizer.rst
    llvm/docs/Reference.rst
    llvm/docs/ReleaseNotes.rst
    llvm/include/llvm/DebugInfo/Symbolize/Markup.h
    llvm/lib/DebugInfo/Symbolize/CMakeLists.txt
    llvm/tools/llvm-symbolizer/Opts.td
    llvm/tools/llvm-symbolizer/llvm-symbolizer.cpp

Removed: 
    


################################################################################
diff  --git a/llvm/docs/CommandGuide/llvm-symbolizer.rst b/llvm/docs/CommandGuide/llvm-symbolizer.rst
index dc8d72ae97625..22ed6d9de00a8 100644
--- a/llvm/docs/CommandGuide/llvm-symbolizer.rst
+++ b/llvm/docs/CommandGuide/llvm-symbolizer.rst
@@ -12,7 +12,9 @@ DESCRIPTION
 -----------
 
 :program:`llvm-symbolizer` reads input names and addresses from the command-line
-and prints corresponding source code locations to standard output.
+and prints corresponding source code locations to standard output. It can also
+symbolize logs containing :doc:`Symbolizer Markup </SymbolizerMarkupFormat>` via
+:option:`--filter-markup`.
 
 If no address is specified on the command-line, it reads the addresses from
 standard input. If no input name is specified on the command-line, but addresses
@@ -213,6 +215,12 @@ OPTIONS
   Look up the object using the given build ID, specified as a hexadecimal
   string. Mutually exclusive with :option:`--obj`.
 
+.. option:: --color [=<always|auto|never>]
+
+  Specify whether to use color in :option:`--filter-markup` mode. Defaults to
+  ``auto``, which detects whether standard output supports color. Specifying
+  ``--color`` alone is equivalent to ``--color=always``.
+
 .. option:: --debuginfod, --no-debuginfod
 
   Whether or not to try debuginfod lookups for debug binaries. Unless specified,
@@ -239,6 +247,15 @@ OPTIONS
   link section, use the specified path as a basis for locating the debug data if
   it cannot be found relative to the object.
 
+.. option:: --filter-markup
+
+  Reads from standard input, converts contained
+  :doc:`Symbolizer Markup </SymbolizerMarkupFormat>` into human-readable form,
+  and prints the results to standard output. Presently, only the following
+  markup elements are supported:
+
+  * ``{{symbol}}``
+
 .. _llvm-symbolizer-opt-f:
 
 .. option:: --functions [=<none|short|linkage>], -f

diff  --git a/llvm/docs/Reference.rst b/llvm/docs/Reference.rst
index 73e9b03814106..f6b80eb96caf4 100644
--- a/llvm/docs/Reference.rst
+++ b/llvm/docs/Reference.rst
@@ -43,6 +43,7 @@ LLVM and API reference documentation.
    StackMaps
    SpeculativeLoadHardening
    Statepoints
+   SymbolizerMarkupFormat
    SystemLibrary
    TestingGuide
    TransformMetadata
@@ -79,6 +80,9 @@ Command Line Utilities
 :doc:`OptBisect`
   A command line option for debugging optimization-induced failures.
 
+:doc:`SymbolizerMarkupFormat`
+  A reference for the log symbolizer markup accepted by ``llvm-symbolizer``.
+
 :doc:`The Microsoft PDB File Format <PDB/index>`
   A detailed description of the Microsoft PDB (Program Database) file format.
 

diff  --git a/llvm/docs/ReleaseNotes.rst b/llvm/docs/ReleaseNotes.rst
index 1e9dc99b90742..4391d55b4e90d 100644
--- a/llvm/docs/ReleaseNotes.rst
+++ b/llvm/docs/ReleaseNotes.rst
@@ -192,6 +192,10 @@ During this release ...
 Changes to the LLVM tools
 ---------------------------------
 
+* (Experimental) :manpage:`llvm-symbolizer(1)` now has ``--filter-markup`` to
+  filter :doc:`Symbolizer Markup </SymbolizerMarkupFormat>` into human-readable
+  form.
+
 Changes to LLDB
 ---------------------------------
 

diff  --git a/llvm/docs/SymbolizerMarkupFormat.rst b/llvm/docs/SymbolizerMarkupFormat.rst
new file mode 100644
index 0000000000000..dfd9d6b5b7706
--- /dev/null
+++ b/llvm/docs/SymbolizerMarkupFormat.rst
@@ -0,0 +1,434 @@
+==========================
+Symbolizer Markup Format
+==========================
+
+.. contents::
+   :local:
+
+Overview
+========
+
+This document defines a text format for log messages that can be processed by a
+symbolizing filter. The basic idea is that logging code emits text that contains
+raw address values and so forth, without the logging code doing any real work to
+convert those values to human-readable form. Instead, logging text uses the
+markup format defined here to identify pieces of information that should be
+converted to human-readable form after the fact. As with other markup formats,
+the expectation is that most of the text will be displayed as is, while the
+markup elements will be replaced with expanded text, or converted into active UI
+elements, that present more details in symbolic form.
+
+This means there is no need for symbol tables, DWARF debugging sections, or
+similar information to be directly accessible at runtime. There is also no need
+at runtime for any logic intended to compute human-readable presentation of
+information, such as C++ symbol demangling. Instead, logging must include markup
+elements that give the contextual information necessary to make sense of the raw
+data, such as memory layout details.
+
+This format identifies markup elements with a syntax that is both simple and
+distinctive. It's simple enough to be matched and parsed with straightforward
+code. It's distinctive enough that character sequences that look like the start
+or end of a markup element should rarely if ever appear incidentally in logging
+text. It's specifically intended not to require sanitizing plain text, such as
+the HTML/XML requirement to replace ``<`` with ``<`` and the like.
+
+:manpage:`llvm-symbolizer(1)` includes a symbolizing filter via its ``--filter``
+option.
+
+Scope and assumptions
+=====================
+
+A symbolizing filter implementation will be independent both of the target
+operating system and machine architecture where the logs are generated and of
+the host operating system and machine architecture where the filter runs.
+
+This format assumes that the symbolizing filter processes intact whole lines. If
+long lines might be split during some stage of a logging pipeline, they must be
+reassembled to restore the original line breaks before feeding lines into the
+symbolizing filter. Most markup elements must appear entirely on a single line
+(often with other text before and/or after the markup element). There are some
+markup elements that are specified to span lines, with line breaks in the middle
+of the element. Even in those cases, the filter is not expected to handle line
+breaks in arbitrary places inside a markup element, but only inside certain
+fields.
+
+This format assumes that the symbolizing filter processes a coherent stream of
+log lines from a single process address space context. If a logging stream
+interleaves log lines from more than one process, these must be collated into
+separate per-process log streams and each stream processed by a separate
+instance of the symbolizing filter. Because the kernel and user processes use
+disjoint address regions in most operating systems, a single user process
+address space plus the kernel address space can be treated as a single address
+space for symbolization purposes if desired.
+
+Dependence on Build IDs
+=======================
+
+The symbolizer markup scheme relies on contextual information about runtime
+memory address layout to make it possible to convert markup elements into useful
+symbolic form. This relies on having an unmistakable identification of which
+binary was loaded at each address.
+
+An ELF Build ID is the payload of an ELF note with name ``"GNU"`` and type
+``NT_GNU_BUILD_ID``, a unique byte sequence that identifies a particular binary
+(executable, shared library, loadable module, or driver module). The linker
+generates this automatically based on a hash that includes the complete symbol
+table and debugging information, even if this is later stripped from the binary.
+
+This specification uses the ELF Build ID as the sole means of identifying
+binaries. Each binary relevant to the log must have been linked with a unique
+Build ID. The symbolizing filter must have some means of mapping a Build ID back
+to the original ELF binary (either the whole unstripped binary, or a stripped
+binary paired with a separate debug file).
+
+Colorization
+============
+
+The markup format supports a restricted subset of ANSI X3.64 SGR (Select Graphic
+Rendition) control sequences. These are unlike other markup elements:
+
+* They specify presentation details (bold or colors) rather than semantic
+  information. The association of semantic meaning with color (e.g. red for
+  errors) is chosen by the code doing the logging, rather than by the UI
+  presentation of the symbolizing filter. This is a concession to existing code
+  (e.g. LLVM sanitizer runtimes) that use specific colors and would require
+  substantial changes to generate semantic markup instead.
+
+* A single control sequence changes "the state", rather than being an
+  hierarchical structure that surrounds affected text.
+
+The filter processes ANSI SGR control sequences only within a single line. If a
+control sequence to enter a bold or color state is encountered, it's expected
+that the control sequence to reset to default state will be encountered before
+the end of that line. If a "dangling" state is left at the end of a line, the
+filter may reset to default state for the next line.
+
+An SGR control sequence is not interpreted inside any other markup element.
+However, other markup elements may appear between SGR control sequences and the
+color/bold state is expected to apply to the symbolic output that replaces the
+markup element in the filter's output.
+
+The accepted SGR control sequences all have the form ``"\033[%um"`` (expressed here
+using C string syntax), where ``%u`` is one of these:
+
+==== ============================ ===============================================
+Code Effect                       Notes
+==== ============================ ===============================================
+0    Reset to default formatting.
+1    Bold text                    Combines with color states, doesn't reset them.
+30   Black foreground
+31   Red foreground
+32   Green foreground
+33   Yellow foreground
+34   Blue foreground
+35   Magenta foreground
+36   Cyan foreground
+37   White foreground
+==== ============================ ===============================================
+
+Common markup element syntax
+============================
+
+All the markup elements share a common syntactic structure to facilitate simple
+matching and parsing code. Each element has the form::
+
+  {{{tag:fields}}}
+
+``tag`` identifies one of the element types described below, and is always a
+short alphabetic string that must be in lower case. The rest of the element
+consists of one or more fields. Fields are separated by ``:`` and cannot contain
+any ``:`` or ``}`` characters. How many fields must be or may be present and
+what they contain is specified for each element type.
+
+No markup elements or ANSI SGR control sequences are interpreted inside the
+contents of a field.
+
+In the descriptions of each element type, ``printf``-style placeholders indicate
+field contents:
+
+``%s``
+  A string of printable characters, not including ``:`` or ``}``.
+
+``%p``
+  An address value represented by ``0x`` followed by an even number of
+  hexadecimal digits (using either lower-case or upper-case for ``A``–``F``).
+  If the digits are all ``0`` then the ``0x`` prefix may be omitted. No more
+  than 16 hexadecimal digits are expected to appear in a single value (64 bits).
+
+``%u``
+  A nonnegative decimal integer.
+
+``%i``
+  A nonnegative integer. The digits are hexadecimal if prefixed by ``0x``, octal
+  if prefixed by ``0``, or decimal otherwise.
+
+``%x``
+  A sequence of an even number of hexadecimal digits (using either lower-case or
+  upper-case for ``A``–``F``), with no ``0x`` prefix. This represents an
+  arbitrary sequence of bytes, such as an ELF Build ID.
+
+Presentation elements
+=====================
+
+These are elements that convey a specific program entity to be displayed in
+human-readable symbolic form.
+
+``{{{symbol:%s}}}``
+  Here ``%s`` is the linkage name for a symbol or type. It may require
+  demangling according to language ABI rules. Even for unmangled names, it's
+  recommended that this markup element be used to identify a symbol name so that
+  it can be presented distinctively.
+
+  Examples::
+
+    {{{symbol:_ZN7Mangled4NameEv}}}
+    {{{symbol:foobar}}}
+
+``{{{pc:%p}}}``, ``{{{pc:%p:ra}}}``, ``{{{pc:%p:pc}}}`` [#not_yet_implemented]_
+
+  Here ``%p`` is the memory address of a code location. It might be presented as a
+  function name and source location. The second two forms distinguish the kind of
+  code location, as described in detail for bt elements below.
+
+  Examples::
+
+    {{{pc:0x12345678}}}
+    {{{pc:0xffffffff9abcdef0}}}
+
+``{{{data:%p}}}`` [#not_yet_implemented]_
+
+  Here ``%p`` is the memory address of a data location. It might be presented as
+  the name of a global variable at that location.
+
+  Examples::
+
+    {{{data:0x12345678}}}
+    {{{data:0xffffffff9abcdef0}}}
+
+``{{{bt:%u:%p}}}``, ``{{{bt:%u:%p:ra}}}``, ``{{{bt:%u:%p:pc}}}`` [#not_yet_implemented]_
+
+  This represents one frame in a backtrace. It usually appears on a line by
+  itself (surrounded only by whitespace), in a sequence of such lines with
+  ascending frame numbers. So the human-readable output might be formatted
+  assuming that, such that it looks good for a sequence of bt elements each
+  alone on its line with uniform indentation of each line. But it can appear
+  anywhere, so the filter should not remove any non-whitespace text surrounding
+  the element.
+
+  Here ``%u`` is the frame number, which starts at zero for the location of the
+  fault being identified, increments to one for the caller of frame zero's call
+  frame, to two for the caller of frame one, etc. ``%p`` is the memory address
+  of a code location.
+
+  Code locations in a backtrace come from two distinct sources. Most backtrace
+  frames describe a return address code location, i.e. the instruction
+  immediately after a call instruction. This is the location of code that has
+  yet to run, since the function called there has not yet returned. Hence the
+  code location of actual interest is usually the call site itself rather than
+  the return address, i.e. one instruction earlier. When presenting the source
+  location for a return address frame, the symbolizing filter will subtract one
+  byte or one instruction length from the actual return address for the call
+  site, with the intent that the address logged can be translated directly to a
+  source location for the call site and not for the apparent return site
+  thereafter (which can be confusing).  When inlined functions are involved, the
+  call site and the return site can appear to be in 
diff erent functions at
+  entirely unrelated source locations rather than just a line away, making the
+  confusion of showing the return site rather the call site quite severe.
+
+  Often the first frame in a backtrace ("frame zero") identifies the precise
+  code location of a fault, trap, or asynchronous interrupt rather than a return
+  address. At other times, even the first frame is actually a return address
+  (for example, backtraces collected at the time of an object allocation and
+  reported later when the allocated object is used or misused). When a system
+  supports in-thread trap handling, there may also be frames after the first
+  that represent a precise interrupted code location rather than a return
+  address, presented as the "caller" of a trap handler function (for example,
+  signal handlers in POSIX systems).
+
+  Return address frames are identified by the ``:ra`` suffix. Precise code
+  location frames are identified by the ``:pc`` suffix.
+
+  Traditional practice has often been to collect backtraces as simple address
+  lists, losing the distinction between return address code locations and
+  precise code locations. Some such code applies the "subtract one" adjustment
+  described above to the address values before reporting them, and it's not
+  always clear or consistent whether this adjustment has been applied or not.
+  These ambiguous cases are supported by the ``bt`` and ``pc`` forms with no
+  ``:ra`` or ``:pc`` suffix, which indicate it's unclear which sort of code
+  location this is.  However, it's highly recommended that all emitters use the
+  suffixed forms and deliver address values with no adjustments applied. When
+  traditional practice has been ambiguous, the majority of cases seem to have
+  been of printing addresses that are return address code locations and printing
+  them without adjustment. So the symbolizing filter will usually apply the
+  "subtract one byte" adjustment to an address printed without a disambiguating
+  suffix. Assuming that a call instruction is longer than one byte on all
+  supported machines, applying the "subtract one byte" adjustment a second time
+  still results in an address somewhere in the call instruction, so a little
+  sloppiness here often does little or no harm.
+
+  Examples::
+
+    {{{bt:0:0x12345678:pc}}}
+    {{{bt:1:0xffffffff9abcdef0:ra}}}
+
+``{{{hexdict:...}}}`` [#not_yet_implemented]_
+
+  This element can span multiple lines. Here ``...`` is a sequence of key-value
+  pairs where a single ``:`` separates each key from its value, and arbitrary
+  whitespace separates the pairs. The value (right-hand side) of each pair
+  either is one or more ``0`` digits, or is ``0x`` followed by hexadecimal
+  digits. Each value might be a memory address or might be some other integer
+  (including an integer that looks like a likely memory address but actually has
+  an unrelated purpose). When the contextual information about the memory layout
+  suggests that a given value could be a code location or a global variable data
+  address, it might be presented as a source location or variable name or with
+  active UI that makes such interpretation optionally visible.
+
+  The intended use is for things like register dumps, where the emitter doesn't
+  know which values might have a symbolic interpretation but a presentation that
+  makes plausible symbolic interpretations available might be very useful to
+  someone reading the log. At the same time, a flat text presentation should
+  usually avoid interfering too much with the original contents and formatting
+  of the dump. For example, it might use footnotes with source locations for
+  values that appear to be code locations. An active UI presentation might show
+  the dump text as is, but highlight values with symbolic information available
+  and pop up a presentation of symbolic details when a value is selected.
+
+  Example::
+
+    {{{hexdict:
+        CS:                   0 RIP:     0x6ee17076fb80 EFL:            0x10246 CR2:                  0
+        RAX:      0xc53d0acbcf0 RBX:     0x1e659ea7e0d0 RCX:                  0 RDX:     0x6ee1708300cc
+        RSI:                  0 RDI:     0x6ee170830040 RBP:     0x3b13734898e0 RSP:     0x3b13734898d8
+        R8:      0x3b1373489860 R9:          0x2776ff4f R10:     0x2749d3e9a940 R11:              0x246
+        R12:     0x1e659ea7e0f0 R13: 0xd7231230fd6ff2e7 R14:     0x1e659ea7e108 R15:      0xc53d0acbcf0
+      }}}
+
+Trigger elements
+================
+
+These elements cause an external action and will be presented to the user in a
+human readable form. Generally they trigger an external action to occur that
+results in a linkable page. The link or some other informative information about
+the external action can then be presented to the user.
+
+``{{{dumpfile:%s:%s}}}`` [#not_yet_implemented]_
+
+  Here the first ``%s`` is an identifier for a type of dump and the second
+  ``%s`` is an identifier for a particular dump that's just been published. The
+  types of dumps, the exact meaning of "published", and the nature of the
+  identifier are outside the scope of the markup format per se. In general it
+  might correspond to writing a file by that name or something similar.
+
+  This element may trigger additional post-processing work beyond symbolizing
+  the markup. It indicates that a dump file of some sort has been published.
+  Some logic attached to the symbolizing filter may understand certain types of
+  dump file and trigger additional post-processing of the dump file upon
+  encountering this element (e.g. generating visualizations, symbolization). The
+  expectation is that the information collected from contextual elements
+  (described below) in the logging stream may be necessary to decode the content
+  of the dump. So if the symbolizing filter triggers other processing, it may
+  need to feed some distilled form of the contextual information to those
+  processes.
+
+  An example of a type identifier is ``sancov``, for dumps from LLVM
+  `SanitizerCoverage <https://clang.llvm.org/docs/SanitizerCoverage.html>`_.
+
+  Example::
+
+    {{{dumpfile:sancov:sancov.8675}}}
+
+Contextual elements
+===================
+
+These are elements that supply information necessary to convert presentation
+elements to symbolic form. Unlike presentation elements, they are not directly
+related to the surrounding text. Contextual elements should appear alone on
+lines with no other non-whitespace text, so that the symbolizing filter might
+elide the whole line from its output without hiding any other log text.
+
+The contextual elements themselves do not necessarily need to be presented in
+human-readable output. However, the information they impart may be essential to
+understanding the logging text even after symbolization. So it's recommended
+that this information be preserved in some form when the original raw log with
+markup may no longer be readily accessible for whatever reason.
+
+Contextual elements should appear in the logging stream before they are needed.
+That is, if some piece of context may affect how the symbolizing filter would
+interpret or present a later presentation element, the necessary contextual
+elements should have appeared somewhere earlier in the logging stream. It should
+always be possible for the symbolizing filter to be implemented as a single pass
+over the raw logging stream, accumulating context and massaging text as it goes.
+
+``{{{reset}}}`` [#not_yet_implemented]_
+
+  This should be output before any other contextual element. The need for this
+  contextual element is to support implementations that handle logs coming from
+  multiple processes. Such implementations might not know when a new process
+  starts or ends. Because some identifying information (like process IDs) might
+  be the same between old and new processes, a way is needed to distinguish two
+  processes with such identical identifying information. This element informs
+  such implementations to reset the state of a filter so that information from a
+  previous process's contextual elements is not assumed for new process that
+  just happens have the same identifying information.
+
+``{{{module:%i:%s:%s:...}}}`` [#not_yet_implemented]_
+
+  This element represents a so-called "module". A "module" is a single linked
+  binary, such as a loaded ELF file. Usually each module occupies a contiguous
+  range of memory.
+
+  Here ``%i`` is the module ID which is used by other contextual elements to
+  refer to this module. The first ``%s`` is a human-readable identifier for the
+  module, such as an ELF ``DT_SONAME`` string or a file name; but it might be
+  empty. It's only for casual information. Only the module ID is used to refer
+  to this module in other contextual elements, never the ``%s`` string. The
+  ``module`` element defining a module ID must always be emitted before any
+  other elements that refer to that module ID, so that a filter never needs to
+  keep track of dangling references. The second ``%s`` is the module type and it
+  determines what the remaining fields are. The following module types are
+  supported:
+
+  * ``elf:%x``
+
+  Here ``%x`` encodes an ELF Build ID. The Build ID should refer to a single
+  linked binary. The Build ID string is the sole way to identify the binary from
+  which this module was loaded.
+
+  Example::
+
+    {{{module:1:libc.so:elf:83238ab56ba10497}}}
+
+``{{{mmap:%p:%i:...}}}`` [#not_yet_implemented]_
+
+  This contextual element is used to give information about a particular region
+  in memory. ``%p`` is the starting address and ``%i`` gives the size in hex of the
+  region of memory. The ``...`` part can take 
diff erent forms to give 
diff erent
+  information about the specified region of memory. The allowed forms are the
+  following:
+
+  * ``load:%i:%s:%p``
+
+  This subelement informs the filter that a segment was loaded from a module.
+  The module is identified by its module ID ``%i``. The ``%s`` is one or more of
+  the letters 'r', 'w', and 'x' (in that order and in either upper or lower
+  case) to indicate this segment of memory is readable, writable, and/or
+  executable. The symbolizing filter can use this information to guess whether
+  an address is a likely code address or a likely data address in the given
+  module. The remaining ``%p`` gives the module relative address. For ELF files
+  the module relative address will be the ``p_vaddr`` of the associated program
+  header. For example if your module's executable segment has
+  ``p_vaddr=0x1000``, ``p_memsz=0x1234``, and was loaded at ``0x7acba69d5000``
+  then you need to subtract ``0x7acba69d4000`` from any address between
+  ``0x7acba69d5000`` and ``0x7acba69d6234`` to get the module relative address.
+  The starting address will usually have been rounded down to the active page
+  size, and the size rounded up.
+
+  Example::
+
+    {{{mmap:0x7acba69d5000:0x5a000:load:1:rx:0x1000}}}
+
+.. rubric:: Footnotes
+
+.. [#not_yet_implemented] This markup element is not yet implemented in
+  :manpage:`llvm-symbolizer(1)`.

diff  --git a/llvm/include/llvm/DebugInfo/Symbolize/Markup.h b/llvm/include/llvm/DebugInfo/Symbolize/Markup.h
index 86c133dd66adf..2628b47cf6d3e 100644
--- a/llvm/include/llvm/DebugInfo/Symbolize/Markup.h
+++ b/llvm/include/llvm/DebugInfo/Symbolize/Markup.h
@@ -9,7 +9,7 @@
 /// \file
 /// This file declares the log symbolizer markup data model and parser.
 ///
-/// \todo Add a link to the reference documentation once added.
+/// See https://llvm.org/docs/SymbolizerMarkupFormat.html
 ///
 //===----------------------------------------------------------------------===//
 

diff  --git a/llvm/include/llvm/DebugInfo/Symbolize/MarkupFilter.h b/llvm/include/llvm/DebugInfo/Symbolize/MarkupFilter.h
new file mode 100644
index 0000000000000..b7d70ccafe66d
--- /dev/null
+++ b/llvm/include/llvm/DebugInfo/Symbolize/MarkupFilter.h
@@ -0,0 +1,76 @@
+//===- MarkupFilter.h -------------------------------------------*- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+///
+/// \file
+/// This file declares a filter that replaces symbolizer markup with
+/// human-readable expressions.
+///
+//===----------------------------------------------------------------------===//
+
+#ifndef LLVM_DEBUGINFO_SYMBOLIZE_MARKUPFILTER_H
+#define LLVM_DEBUGINFO_SYMBOLIZE_MARKUPFILTER_H
+
+#include "Markup.h"
+
+#include "llvm/Support/WithColor.h"
+#include "llvm/Support/raw_ostream.h"
+
+namespace llvm {
+namespace symbolize {
+
+/// Filter to convert parsed log symbolizer markup elements into human-readable
+/// text.
+class MarkupFilter {
+public:
+  MarkupFilter(raw_ostream &OS, Optional<bool> ColorsEnabled = llvm::None);
+
+  /// Begins a logical \p Line of markup.
+  ///
+  /// This must be called for each line of the input stream before calls to
+  /// filter() for elements of that line. The provided \p Line must be the same
+  /// one that was passed to parseLine() to produce the elements to be later
+  /// passed to filter().
+  ///
+  /// This informs the filter that a new line is beginning and establishes a
+  /// context for error location reporting.
+  void beginLine(StringRef Line);
+
+  /// Handle a \p Node of symbolizer markup.
+  ///
+  /// If the node is a recognized, valid markup element, it is replaced with a
+  /// human-readable string. If the node isn't an element or the element isn't
+  /// recognized, it is output verbatim. If the element is recognized but isn't
+  /// valid, it is omitted from the output.
+  void filter(const MarkupNode &Node);
+
+private:
+  bool trySGR(const MarkupNode &Node);
+
+  void highlight();
+  void restoreColor();
+  void resetColor();
+
+  bool checkTag(const MarkupNode &Node) const;
+  bool checkNumFields(const MarkupNode &Node, size_t Size) const;
+
+  void reportTypeError(StringRef Str, StringRef TypeName) const;
+  void reportLocation(StringRef::iterator Loc) const;
+
+  raw_ostream &OS;
+  const bool ColorsEnabled;
+
+  StringRef Line;
+
+  Optional<raw_ostream::Colors> Color;
+  bool Bold = false;
+};
+
+} // end namespace symbolize
+} // end namespace llvm
+
+#endif // LLVM_DEBUGINFO_SYMBOLIZE_MARKUPFILTER_H

diff  --git a/llvm/lib/DebugInfo/Symbolize/CMakeLists.txt b/llvm/lib/DebugInfo/Symbolize/CMakeLists.txt
index c83d957eeb9d5..47cb4243ef9a2 100644
--- a/llvm/lib/DebugInfo/Symbolize/CMakeLists.txt
+++ b/llvm/lib/DebugInfo/Symbolize/CMakeLists.txt
@@ -2,6 +2,7 @@ add_llvm_component_library(LLVMSymbolize
   DIFetcher.cpp
   DIPrinter.cpp
   Markup.cpp
+  MarkupFilter.cpp
   SymbolizableObjectFile.cpp
   Symbolize.cpp
 

diff  --git a/llvm/lib/DebugInfo/Symbolize/MarkupFilter.cpp b/llvm/lib/DebugInfo/Symbolize/MarkupFilter.cpp
new file mode 100644
index 0000000000000..42719ddbef4c6
--- /dev/null
+++ b/llvm/lib/DebugInfo/Symbolize/MarkupFilter.cpp
@@ -0,0 +1,143 @@
+//===-- lib/DebugInfo/Symbolize/MarkupFilter.cpp -------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+///
+/// \file
+/// This file defines the implementation of a filter that replaces symbolizer
+/// markup with human-readable expressions.
+///
+//===----------------------------------------------------------------------===//
+
+#include "llvm/DebugInfo/Symbolize/MarkupFilter.h"
+
+#include "llvm/ADT/None.h"
+#include "llvm/ADT/STLExtras.h"
+#include "llvm/ADT/StringSwitch.h"
+#include "llvm/Demangle/Demangle.h"
+#include "llvm/Support/WithColor.h"
+#include "llvm/Support/raw_ostream.h"
+
+using namespace llvm;
+using namespace llvm::symbolize;
+
+MarkupFilter::MarkupFilter(raw_ostream &OS, Optional<bool> ColorsEnabled)
+    : OS(OS), ColorsEnabled(ColorsEnabled.getValueOr(
+                  WithColor::defaultAutoDetectFunction()(OS))) {}
+
+void MarkupFilter::beginLine(StringRef Line) {
+  this->Line = Line;
+  resetColor();
+}
+
+void MarkupFilter::filter(const MarkupNode &Node) {
+  if (!checkTag(Node))
+    return;
+
+  if (trySGR(Node))
+    return;
+
+  if (Node.Tag == "symbol") {
+    if (!checkNumFields(Node, 1))
+      return;
+    highlight();
+    OS << llvm::demangle(Node.Fields.front().str());
+    restoreColor();
+    return;
+  }
+
+  OS << Node.Text;
+}
+
+bool MarkupFilter::trySGR(const MarkupNode &Node) {
+  if (Node.Text == "\033[0m") {
+    resetColor();
+    return true;
+  }
+  if (Node.Text == "\033[1m") {
+    Bold = true;
+    if (ColorsEnabled)
+      OS.changeColor(raw_ostream::Colors::SAVEDCOLOR, Bold);
+    return true;
+  }
+  auto SGRColor = StringSwitch<Optional<raw_ostream::Colors>>(Node.Text)
+                      .Case("\033[30m", raw_ostream::Colors::BLACK)
+                      .Case("\033[31m", raw_ostream::Colors::RED)
+                      .Case("\033[32m", raw_ostream::Colors::GREEN)
+                      .Case("\033[33m", raw_ostream::Colors::YELLOW)
+                      .Case("\033[34m", raw_ostream::Colors::BLUE)
+                      .Case("\033[35m", raw_ostream::Colors::MAGENTA)
+                      .Case("\033[36m", raw_ostream::Colors::CYAN)
+                      .Case("\033[37m", raw_ostream::Colors::WHITE)
+                      .Default(llvm::None);
+  if (SGRColor) {
+    Color = *SGRColor;
+    if (ColorsEnabled)
+      OS.changeColor(*Color);
+    return true;
+  }
+
+  return false;
+}
+
+// Begin highlighting text by picking a 
diff erent color than the current color
+// state.
+void MarkupFilter::highlight() {
+  if (!ColorsEnabled)
+    return;
+  OS.changeColor(Color == raw_ostream::Colors::BLUE ? raw_ostream::Colors::CYAN
+                                                    : raw_ostream::Colors::BLUE,
+                 Bold);
+}
+
+// Set the output stream's color to the current color and bold state of the SGR
+// abstract machine.
+void MarkupFilter::restoreColor() {
+  if (!ColorsEnabled)
+    return;
+  if (Color) {
+    OS.changeColor(*Color, Bold);
+  } else {
+    OS.resetColor();
+    if (Bold)
+      OS.changeColor(raw_ostream::Colors::SAVEDCOLOR, Bold);
+  }
+}
+
+// Set the SGR and output stream's color and bold states back to the default.
+void MarkupFilter::resetColor() {
+  if (!Color && !Bold)
+    return;
+  Color.reset();
+  Bold = false;
+  if (ColorsEnabled)
+    OS.resetColor();
+}
+
+bool MarkupFilter::checkTag(const MarkupNode &Node) const {
+  if (any_of(Node.Tag, [](char C) { return C < 'a' || C > 'z'; })) {
+    WithColor::error(errs()) << "tags must be all lowercase characters\n";
+    reportLocation(Node.Tag.begin());
+    return false;
+  }
+  return true;
+}
+
+bool MarkupFilter::checkNumFields(const MarkupNode &Node, size_t Size) const {
+  if (Node.Fields.size() != Size) {
+    WithColor::error(errs()) << "expected " << Size << " fields; found "
+                             << Node.Fields.size() << "\n";
+    reportLocation(Node.Tag.end());
+    return false;
+  }
+  return true;
+}
+
+void MarkupFilter::reportLocation(StringRef::iterator Loc) const {
+  errs() << Line;
+  WithColor(errs().indent(Loc - Line.begin()), HighlightColor::String) << '^';
+  errs() << '\n';
+}

diff  --git a/llvm/test/DebugInfo/symbolize-filter-markup-color.test b/llvm/test/DebugInfo/symbolize-filter-markup-color.test
new file mode 100644
index 0000000000000..49f50fbc1ae75
--- /dev/null
+++ b/llvm/test/DebugInfo/symbolize-filter-markup-color.test
@@ -0,0 +1,31 @@
+RUN: echo -e "\033[1mbold\033[0mreset" > %t.input
+RUN: echo -e "\033[1mboldnoreset" >> %t.input
+RUN: echo -e "resetafternewline" >> %t.input
+RUN: echo -e "\033[30mcolor\033[0m" >> %t.input
+RUN: echo -e "\033[31mcolor\033[0m" >> %t.input
+RUN: echo -e "\033[32mcolor\033[0m" >> %t.input
+RUN: echo -e "\033[33mcolor\033[0m" >> %t.input
+RUN: echo -e "\033[34mcolor\033[0m" >> %t.input
+RUN: echo -e "\033[35mcolor\033[0m" >> %t.input
+RUN: echo -e "\033[36mcolor\033[0m" >> %t.input
+RUN: echo -e "\033[37mcolor\033[0m" >> %t.input
+RUN: echo -e "\033[33mbefore{{{symbol:highlight}}}after\033[0m" >> %t.input
+RUN: echo -e "\033[34msame{{{symbol:highlight}}}after\033[0m" >> %t.input
+RUN: echo -e "\033[1mbold{{{symbol:highlight}}}after\033[0m" >> %t.input
+RUN: llvm-symbolizer --filter-markup --color=always < %t.input > %t.output
+RUN: FileCheck %s --input-file=%t.output --match-full-lines --implicit-check-not {{.}}
+
+CHECK: {{.}}[1mbold{{.}}[0mreset
+CHECK: {{.}}[1mboldnoreset
+CHECK: {{.}}[0mresetafternewline
+CHECK: {{.}}[0;30mcolor{{.}}[0m
+CHECK: {{.}}[0;31mcolor{{.}}[0m
+CHECK: {{.}}[0;32mcolor{{.}}[0m
+CHECK: {{.}}[0;33mcolor{{.}}[0m
+CHECK: {{.}}[0;34mcolor{{.}}[0m
+CHECK: {{.}}[0;35mcolor{{.}}[0m
+CHECK: {{.}}[0;36mcolor{{.}}[0m
+CHECK: {{.}}[0;37mcolor{{.}}[0m
+CHECK: {{.}}[0;33mbefore{{.}}[0;34mhighlight{{.}}[0;33mafter{{.}}[0m
+CHECK: {{.}}[0;34msame{{.}}[0;36mhighlight{{.}}[0;34mafter{{.}}[0m
+CHECK: {{.}}[1mbold{{.}}[0;1;34mhighlight{{.}}[0m{{.}}[1mafter{{.}}[0m

diff  --git a/llvm/test/DebugInfo/symbolize-filter-markup-error-location.test b/llvm/test/DebugInfo/symbolize-filter-markup-error-location.test
new file mode 100644
index 0000000000000..4d05bfd39ca99
--- /dev/null
+++ b/llvm/test/DebugInfo/symbolize-filter-markup-error-location.test
@@ -0,0 +1,17 @@
+RUN: split-file %s %t
+RUN: llvm-symbolizer --debug-file-directory=%p/Inputs --filter-markup < %t/log > /dev/null 2> %t.err
+RUN: FileCheck %s -input-file=%t.err --match-full-lines --strict-whitespace
+
+CHECK:error: expected 1 fields; found 0
+CHECK:[[BEGIN:[{]{3}]]symbol[[END:[}]{3}]]
+CHECK:         ^
+CHECK:error: expected 1 fields; found 0
+CHECK:foo[[BEGIN]]symbol[[END]]bar[[BEGIN]]symbol[[END]]baz
+CHECK:            ^
+CHECK:error: expected 1 fields; found 0
+CHECK:foo[[BEGIN]]symbol[[END]]bar[[BEGIN]]symbol[[END]]baz
+CHECK:                           ^
+
+;--- log
+{{{symbol}}}
+foo{{{symbol}}}bar{{{symbol}}}baz

diff  --git a/llvm/test/DebugInfo/symbolize-filter-markup-symbol.test b/llvm/test/DebugInfo/symbolize-filter-markup-symbol.test
new file mode 100644
index 0000000000000..9c1ed5e46e01b
--- /dev/null
+++ b/llvm/test/DebugInfo/symbolize-filter-markup-symbol.test
@@ -0,0 +1,10 @@
+RUN: split-file %s %t
+RUN: llvm-symbolizer --filter-markup < %t/input > %t.output
+RUN: FileCheck %s --input-file=%t.output --match-full-lines --implicit-check-not {{.}}
+
+CHECK: foo
+CHECK: Mangled::Name()
+
+;--- input
+{{{symbol:foo}}}
+{{{symbol:_ZN7Mangled4NameEv}}}

diff  --git a/llvm/test/DebugInfo/symbolize-filter-markup-tag.test b/llvm/test/DebugInfo/symbolize-filter-markup-tag.test
new file mode 100644
index 0000000000000..36aefc323c02c
--- /dev/null
+++ b/llvm/test/DebugInfo/symbolize-filter-markup-tag.test
@@ -0,0 +1,10 @@
+RUN: split-file %s %t
+RUN: llvm-symbolizer --filter-markup < %t/input 2> %t.error
+RUN: FileCheck %s --input-file=%t.error --match-full-lines
+
+CHECK: error: tags must be all lowercase characters
+CHECK: error: tags must be all lowercase characters
+
+;--- input
+{{{t2g}}}
+{{{tAg}}}

diff  --git a/llvm/test/tools/llvm-symbolizer/filter-markup.test b/llvm/test/tools/llvm-symbolizer/filter-markup.test
new file mode 100644
index 0000000000000..4610994b40ac1
--- /dev/null
+++ b/llvm/test/tools/llvm-symbolizer/filter-markup.test
@@ -0,0 +1,21 @@
+RUN: echo -e "a{{{symbol:foo}}}b\n{{{symbol:bar}}}\n" > %t.input
+RUN: llvm-symbolizer --filter-markup < %t.input > %t.nocolor
+RUN: FileCheck %s --check-prefix=NOCOLOR --input-file=%t.nocolor --match-full-lines --implicit-check-not {{.}}
+
+NOCOLOR: afoob
+NOCOLOR: bar
+
+RUN: llvm-symbolizer --filter-markup --color < %t.input > %t.color
+RUN: FileCheck %s --check-prefix=COLOR --input-file=%t.color --match-full-lines --implicit-check-not {{.}}
+
+RUN: llvm-symbolizer --filter-markup --color=auto < %t.input > %t.autocolor
+RUN: FileCheck %s --check-prefix=NOCOLOR --input-file=%t.autocolor --match-full-lines --implicit-check-not {{.}}
+
+RUN: llvm-symbolizer --filter-markup --color=never < %t.input > %t.nevercolor
+RUN: FileCheck %s --check-prefix=NOCOLOR --input-file=%t.nevercolor --match-full-lines --implicit-check-not {{.}}
+
+RUN: llvm-symbolizer --filter-markup --color=always < %t.input > %t.alwayscolor
+RUN: FileCheck %s --check-prefix=COLOR --input-file=%t.alwayscolor --match-full-lines --implicit-check-not {{.}}
+
+COLOR: a{{.}}[0;34mfoo{{.}}[0mb
+COLOR: {{.}}[0;34mbar{{.}}[0m

diff  --git a/llvm/tools/llvm-symbolizer/Opts.td b/llvm/tools/llvm-symbolizer/Opts.td
index dae1bd611fdd8..6742e086d6ff9 100644
--- a/llvm/tools/llvm-symbolizer/Opts.td
+++ b/llvm/tools/llvm-symbolizer/Opts.td
@@ -23,12 +23,15 @@ defm adjust_vma
 def basenames : Flag<["--"], "basenames">, HelpText<"Strip directory names from paths">;
 defm build_id : Eq<"build-id", "Build ID used to look up the object file">;
 defm cache_size : Eq<"cache-size", "Max size in bytes of the in-memory binary cache.">;
+def color : F<"color", "Use color when symbolizing log markup.">;
+def color_EQ : Joined<["--"], "color=">, HelpText<"Whether to use color when symbolizing log markup: always, auto, never">, Values<"always,auto,never">;
 defm debug_file_directory : Eq<"debug-file-directory", "Path to directory where to look for debug files">, MetaVarName<"<dir>">;
 defm debuginfod : B<"debuginfod", "Use debuginfod to find debug binaries", "Don't use debuginfod to find debug binaries">;
 defm default_arch
     : Eq<"default-arch", "Default architecture (for multi-arch objects)">,
       Group<grp_mach_o>;
 defm demangle : B<"demangle", "Demangle function names", "Don't demangle function names">;
+def filter_markup : Flag<["--"], "filter-markup">, HelpText<"Filter symbolizer markup from stdin.">;
 def functions : F<"functions", "Print function name for a given address">;
 def functions_EQ : Joined<["--"], "functions=">, HelpText<"Print function name for a given address">, Values<"none,short,linkage">;
 def help : F<"help", "Display this help">;

diff  --git a/llvm/tools/llvm-symbolizer/llvm-symbolizer.cpp b/llvm/tools/llvm-symbolizer/llvm-symbolizer.cpp
index c9792788ae6c0..b782c7a1720ab 100644
--- a/llvm/tools/llvm-symbolizer/llvm-symbolizer.cpp
+++ b/llvm/tools/llvm-symbolizer/llvm-symbolizer.cpp
@@ -19,6 +19,8 @@
 #include "llvm/ADT/StringRef.h"
 #include "llvm/Config/config.h"
 #include "llvm/DebugInfo/Symbolize/DIPrinter.h"
+#include "llvm/DebugInfo/Symbolize/Markup.h"
+#include "llvm/DebugInfo/Symbolize/MarkupFilter.h"
 #include "llvm/DebugInfo/Symbolize/SymbolizableModule.h"
 #include "llvm/DebugInfo/Symbolize/Symbolize.h"
 #include "llvm/Debuginfod/DIFetcher.h"
@@ -337,6 +339,17 @@ static FunctionNameKind decideHowToPrintFunctions(const opt::InputArgList &Args,
   return IsAddr2Line ? FunctionNameKind::None : FunctionNameKind::LinkageName;
 }
 
+static Optional<bool> parseColorArg(const opt::InputArgList &Args) {
+  if (Args.hasArg(OPT_color))
+    return true;
+  if (const opt::Arg *A = Args.getLastArg(OPT_color_EQ))
+    return StringSwitch<Optional<bool>>(A->getValue())
+        .Case("always", true)
+        .Case("never", false)
+        .Case("auto", None);
+  return None;
+}
+
 static SmallVector<uint8_t> parseBuildIDArg(const opt::InputArgList &Args,
                                             int ID) {
   const opt::Arg *A = Args.getLastArg(ID);
@@ -352,6 +365,22 @@ static SmallVector<uint8_t> parseBuildIDArg(const opt::InputArgList &Args,
   return BuildID;
 }
 
+// Symbolize the markup from stdin and write the result to stdout.
+static void filterMarkup(const opt::InputArgList &Args) {
+  MarkupParser Parser;
+  MarkupFilter Filter(outs(), parseColorArg(Args));
+  for (std::string InputString; std::getline(std::cin, InputString);) {
+    InputString += '\n';
+    Parser.parseLine(InputString);
+    Filter.beginLine(InputString);
+    while (Optional<MarkupNode> Element = Parser.nextNode())
+      Filter.filter(*Element);
+  }
+  Parser.flush();
+  while (Optional<MarkupNode> Element = Parser.nextNode())
+    Filter.filter(*Element);
+}
+
 ExitOnError ExitOnErr;
 
 int main(int argc, char **argv) {
@@ -413,6 +442,11 @@ int main(int argc, char **argv) {
     }
   }
 
+  if (Args.hasArg(OPT_filter_markup)) {
+    filterMarkup(Args);
+    return 0;
+  }
+
   auto Style = IsAddr2Line ? OutputStyle::GNU : OutputStyle::LLVM;
   if (const opt::Arg *A = Args.getLastArg(OPT_output_style_EQ)) {
     if (strcmp(A->getValue(), "GNU") == 0)


        


More information about the llvm-commits mailing list