[LLVMdev] DataFlowSanitizer design discussion

Peter Collingbourne peter at pcc.me.uk
Tue Aug 6 17:55:13 PDT 2013


Hi,

If there are no further comments on the design below I intend to commit
my DFSan patches in a week.

Thanks,
Peter

On Tue, Jun 25, 2013 at 06:13:49PM -0700, Peter Collingbourne wrote:
> On Thu, Jun 13, 2013 at 03:00:46PM -0700, Peter Collingbourne wrote:
> > Hi,
> > 
> > I am starting a thread to discuss the design of DataFlowSanitizer,
> > a compiler instrumentation based analysis tool which I am hoping to
> > bring into LLVM.  As a starting point, I have included the current
> > version of the design document below.  Comments are appreciated.
> 
> Any further comments on the below?  I've updated the design document
> to add a use case at Kostya's request, but I'd appreciate any further
> review of the design.
> 
> Thanks,
> Peter
> 
> 
> DataFlowSanitizer Design Document
> *********************************
> 
> This document sets out the design for DataFlowSanitizer, a general
> dynamic data flow analysis.  Unlike other Sanitizer tools, this tool
> is not designed to detect a specific class of bugs on its own.
> Instead, it provides a generic dynamic data flow analysis framework to
> be used by clients to help detect application-specific issues within
> their own code.
> 
> DataFlowSanitizer is a program instrumentation which can associate a
> number of taint labels with any data stored in any memory region
> accessible by the program. The analysis is dynamic, which means that
> it operates on a running program, and tracks how the labels propagate
> through that program. The tool shall support a large (>100) number of
> labels, such that programs which operate on large numbers of data
> items may be analysed with each data item being tracked separately.
> 
> 
> Use Cases
> =========
> 
> This instrumentation can be used as a tool to help monitor how data
> flows from a program's inputs (sources) to its outputs (sinks). This
> has applications from a privacy/security perspective in that one can
> audit how a sensitive data item is used within a program and ensure it
> isn't exiting the program anywhere it shouldn't be.
> 
> 
> Interface
> =========
> 
> A number of functions are provided which will create taint labels,
> attach labels to memory regions and extract the set of labels
> associated with a specific memory region. These functions are declared
> in the header file "sanitizer/dfsan_interface.h".
> 
>    /// Creates and returns a base label with the given description and user data.
>    dfsan_label dfsan_create_label(const char *desc, void *userdata);
> 
>    /// Sets the label for each address in [addr,addr+size) to \c label.
>    void dfsan_set_label(dfsan_label label, void *addr, size_t size);
> 
>    /// Sets the label for each address in [addr,addr+size) to the union of the
>    /// current label for that address and \c label.
>    void dfsan_add_label(dfsan_label label, void *addr, size_t size);
> 
>    /// Retrieves the label associated with the given data.
>    ///
>    /// The type of 'data' is arbitrary.  The function accepts a value of any type,
>    /// which can be truncated or extended (implicitly or explicitly) as necessary.
>    /// The truncation/extension operations will preserve the label of the original
>    /// value.
>    dfsan_label dfsan_get_label(long data);
> 
>    /// Retrieves a pointer to the dfsan_label_info struct for the given label.
>    const struct dfsan_label_info *dfsan_get_label_info(dfsan_label label);
> 
>    /// Returns whether the given label label contains the label elem.
>    int dfsan_has_label(dfsan_label label, dfsan_label elem);
> 
>    /// If the given label label contains a label with the description desc, returns
>    /// that label, else returns 0.
>    dfsan_label dfsan_has_label_with_desc(dfsan_label label, const char *desc);
> 
> 
> Taint label representation
> ==========================
> 
> As stated above, the tool must track a large number of taint labels.
> This poses an implementation challenge, as most multiple-label
> tainting systems assign one label per bit to shadow storage, and union
> taint labels using a bitwise or operation. This will not scale to
> clients which use hundreds or thousands of taint labels, as the label
> union operation becomes O(n) in the number of supported labels, and
> data associated with it will quickly dominate the live variable set,
> causing register spills and hampering performance.
> 
> Instead, a low overhead approach is proposed which is best-case
> O(log_2 n) during execution. The underlying assumption is that the
> required space of label unions is sparse, which is a reasonable
> assumption to make given that we are optimizing for the case where
> applications mostly copy data from one place to another, without often
> invoking the need for an actual union operation. The representation of
> a taint label is a 16-bit integer, and new labels are allocated
> sequentially from a pool. The label identifier 0 is special, and means
> that the data item is unlabelled.
> 
> When a label union operation is requested at a join point (any
> arithmetic or logical operation with two or more operands, such as
> addition), the code checks whether a union is required, whether the
> same union has been requested before, and whether one union label
> subsumes the other. If so, it returns the previously allocated union
> label. If not, it allocates a new union label from the same pool used
> for new labels.
> 
> Specifically, the instrumentation pass will insert code like this to
> decide the union label "lu" for a pair of labels "l1" and "l2":
> 
>    if (l1 == l2)
>      lu = l1;
>    else
>      lu = __dfsan_union(l1, l2);
> 
> The equality comparison is outlined, to provide an early exit in the
> common cases where the program is processing unlabelled data, or where
> the two data items have the same label.  "__dfsan_union" is a runtime
> library function which performs all other union computation.
> 
> Further optimizations are possible, for example if "l1" is known at
> compile time to be zero (e.g. it is derived from a constant), "l2" can
> be used for "lu", and vice versa.
> 
> 
> Memory layout and label management
> ==================================
> 
> The following is the current memory layout for Linux/x86_64:
> 
> +-----------------+-----------------+----------------------+
> | Start           | End             | Use                  |
> +=================+=================+======================+
> | 0x700000008000  | 0x800000000000  | application memory   |
> +-----------------+-----------------+----------------------+
> | 0x200200000000  | 0x700000008000  | unused               |
> +-----------------+-----------------+----------------------+
> | 0x200000000000  | 0x200200000000  | union table          |
> +-----------------+-----------------+----------------------+
> | 0x000000010000  | 0x200000000000  | shadow memory        |
> +-----------------+-----------------+----------------------+
> | 0x000000000000  | 0x000000010000  | reserved by kernel   |
> +-----------------+-----------------+----------------------+
> 
> Each byte of application memory corresponds to two bytes of shadow
> memory, which are used to store its taint label. As for LLVM SSA
> registers, we have not found it necessary to associate a label with
> each byte or bit of data, as some other tools do. Instead, labels are
> associated directly with registers.  Loads will result in a union of
> all shadow labels corresponding to bytes loaded (which most of the
> time will be short circuited by the initial comparison) and stores
> will result in a copy of the label to the shadow of all bytes stored
> to.
> 
> -- 
> Peter

-- 
Peter



More information about the llvm-dev mailing list