[llvm-dev] Dynamic VMA in Sanitizers for AArch64

Fri Sep 25 01:30:21 PDT 2015

Thanks for writing this up Renato.

What you describe below has been the option I've preferred for a while,
so it looks like a good approach to me.

I just wanted to note that on AArch64, having the shadow offset in a register
rather than as an immediate, could result in faster execution rather
than slower, as the computation of the shadow address can be done
in a single instruction rather than 2 that way. Assuming 
x(SHADOW_OFFSET) is the register containing the shadow offset:

        add     x8, x(SHADOW_OFFSET), x0, #3

instead of

        lsr     x8, x0, #3
        orr     x8, x8, #0x1000000000

But as you say, it'll need to be measured what the overall performance
effect is of dynamic VMA support.

Thanks,

Kristof

> -----Original Message-----
> From: Renato Golin [mailto:renato.golin at linaro.org]
> Sent: 25 September 2015 09:20
> To: Kostya Serebryany; Evgenii Stepanov; Kristof Beyls; James Molloy;
> Adhemerval Zanella; Saleem Abdulrasool; Christophe Lyon
> Cc: Jakub Jelinek; Ramana Radhakrishnan; Will Deacon; LLVM Dev
> Subject: Dynamic VMA in Sanitizers for AArch64
> 
> Hi folks,
> 
> After long talks with lots of people, I think we have a winning strategy
> to deal with the variable nature of VMA address in AArch64.
> It seems that the best way forward is to try the dynamic calculation at
> runtime, evaluate the performance, and then, only if the hit is too
> great, think about compile-time alternatives. I'd like to know if
> everyone is in agreement, so we could get cracking.
> 
> 
>   The Issues
> 
> If you're not familiar with the problem, here's a quick run down...
> 
> On most systems, the VMA address (and thus shadow mask and shift
> value) is a constant. This produces very efficient code, as the shadow
> address computation becomes an immediate shift plus a constant mask.
> But AArch64 is different.
> 
> In order to execute 32-bit code, the kernel has to use 4k pages, and
> that is currently configured with either 39 or 48 bits VMA. For 64-bit
> only, 64k pages are set, and you can use either 42 or 48 VMA address.
> In theory, the kernel could use even more bits and different page sizes,
> and systems are free to choose, and have done so different values
> already.
> 
> What it means is that the VMA value can change depending on the kernel,
> and cross-compilation for testing on multiple systems will not work
> unless the true value is computed at runtime. But it also means that the
> value will have to be stored in a global constant, which will require
> additional loads and register shifts per instrumentation, which can slow
> down the execution even further.
> 
> 
>   The Current Status
> 
> Right now, in order to test it, we made it into a compiler-build option.
> As you build Clang/LLVM, you can use a CMake option to set the VMA, with
> 39 being the default. We have 39 and 42 buildbots to make sure all works
> well, but that's clearly the wrong solution for anything other than
> enablement.
> 
> With all the sanitizers going in for AArch64, we can now focus on making
> a good implementation for the VMA issue, in a way that benefits both
> LLVM and GCC, since they have different usages (static vs dynamic
> linkage).
> 
> With the build time option and making it static, we have the best
> performance we could ever have. Means that any further change will
> impact performance, but they're necessary, so we just need to take the
> lower cost / higher benefit option.
> 
> 
>   The Options
> 
> The two options we have are:
> 
> 1. Dynamic VMA: instrument main() to check the VMA value and set a
> global value. Instrument each function to load that value into a local
> register. Instrument each load/store/malloc/free to check the VMA based
> on that register. This may be optimised by the compiler for the compiler
> instrumented code, but will not for the library calls.
> 
> 2. Add a compiler option -mvma=NN that chooses at compile time the VMA
> and makes it static in the user code. This has the same performance as
> currently for compiler instrumented code, but will not be for library
> calls, especially for the dynamic version. This is faster, but it's also
> less flexible than option 1, though more flexible than the current
> implementation.
> 
> 
>   The Plan
> 
> Right now, we're planning on implementing the full dynamic VMA and
> investigate the performance impacts. If it is within acceptable ranges,
> we just go along with it, and check the compile-time flag at a later
> time, as further optimisation.
> 
> If impact is too great, we might want to profile and implement -mvma
> straight after the dynamic VMA checks. If that's the case, we should
> keep *both* implementations, so that users could choose what suits them
> best.
> 
> Either way, I'd like to get the opinion of everybody to make sure I'm
> not forgetting anything before we start cracking the problem into an
> acceptable solution.
> 
> cheers,
> --renato