[llvm] r211705 - Random Number Generator (llvm)

Tue Aug 19 14:57:53 PDT 2014

Hi all,

Just wanted to ping this thread regarding the latest patch revision:
http://reviews.llvm.org/D4377. This does not include anything fancy
like randomness buffers, which can be added later. However, it does
remove the inverted dependency from Module, and should fit LLVM's
architecture much better now. Any comments? I think we should go ahead
with this revision and add additional features such as crypto-security
or randomness buffers if/when users of the RNG need them.

- stephen

On Mon, Jul 14, 2014 at 1:28 PM, Stephen Crane <sjcrane at uci.edu> wrote:
> I have some concerns on how would this work with large parallel
> builds. For our use case we want something we can drop into an
> existing build system with no modification and use randomness in LLVM
> passes. I don't see a simple way to make this work in a reproducible
> fashion with randomness files or buffers. If we have some complicated
> start/stop/position mechanics, how does this handle data races?
>
> I originally contributed an extremely simple CSPRNG based on the
> already existing support in LLVM for MD5. That would be my ideal
> solution if we want crypto-security. However, I don't think we need a
> crypto-secure RNG, at least for the use-case we have: disrupting
> code-reuse attacks by randomizing code layout. If the attacker can
> read enough of the binary to reverse engineer the RNG, I'm confident
> he would also have read enough code at that point to perform a
> straightforward attack using this code.
>
> - Stephen
>
> On Mon, Jul 14, 2014 at 1:16 PM, Geremy Condra <gcondra at google.com> wrote:
>> On Thu, Jul 10, 2014 at 1:39 PM, Nick Lewycky <nlewycky at google.com> wrote:
>>>
>>> On 10 July 2014 13:02, Geremy Condra <gcondra at google.com> wrote:
>>>>
>>>> On Wed, Jul 9, 2014 at 12:17 AM, Nick Lewycky <nicholas at mxc.ca> wrote:
>>>>>
>>>>> Stephen Crane wrote:
>>>>>>
>>>>>> On Tue, Jul 8, 2014 at 11:26 AM, Geremy Condra<gcondra at google.com>
>>>>>> wrote:
>>>>>>>
>>>>>>> It seems much better to me to use a CPRNG than to rely on something
>>>>>>> like MT,
>>>>>>> which has significant weaknesses despite its long period. As long as I
>>>>>>> can
>>>>>>> hand it /dev/urandom or an equivalent seed file *and actually use
>>>>>>> that* I
>>>>>>> don't 1000% care, but using a non-cryptographic RNG on top of that is
>>>>>>> very
>>>>>>> smelly.
>>>>>>
>>>>>>
>>>>>> I completely agree with you that a CSPRNG would be best. However, we
>>>>>> got so much pushback from the mailing list that I felt it was better
>>>>>> to start small. Keeping the current interface and adding an optional
>>>>>> better implementation underneath seems like the way to go here.
>>>>>
>>>>>
>>>>> I'm not opposed to a CSPRNG here, but I am concerned. Firstly I don't
>>>>> see why we should need it and I'd like the consumers of the random stream to
>>>>> ensure that aren't relying on any particular strength of the random stream.
>>>>> If they want to do a hash on the RNG output to prevent correlation, the
>>>>> caller should do that. Second, I'm not sure I trust us LLVMers to maintain a
>>>>> cryptographically strong RNG. I don't know that we have the skill set for
>>>>> that.
>>>>
>>>>
>>>> Thus my suggestion to use an external stream of randomness, which
>>>> requires essentially zero cryptographic skill to audit and reduces the
>>>> amount of code to boot.
>>>
>>>
>>> That's a very good point.
>>>
>>>>> If it's critical to have a CSPRNG to make your feature useful then you
>>>>> should argue for it. As it is, the plan is to permit upgrading to a newer
>>>>> RNG by using a different NamedMDNode name which includes the algorithm name.
>>>>>
>>>>>
>>>>>> At least for our use cases, we couldn't use /dev/{u}random directly
>>>>>> because we needed reproducibility. However, the workflow I plan to use
>>>>>> with this is grab a seed from /dev/random at the beginning of the
>>>>>> build process, note that down somewhere, and use that seed for the
>>>>>> rest of the build. We could certainly do something similar with a
>>>>>> slightly modified RNG impl class which uses a random buffer or
>>>>>> separate process to generate better randomness with a larger seed.
>>>>>>
>>>>>>> It also simplifies the code (since you don't need to add in a new RNG,
>>>>>>> just
>>>>>>> read off of a stream) and makes it more testable (since RNGs are
>>>>>>> notoriously
>>>>>>> easy to get wrong and hard to prove right).
>>>>>>
>>>>>>
>>>>>> Yes, as long as that stream is reproducible somehow. I think we should
>>>>>> preserve the option to recreate all random choices made by LLVM when
>>>>>> bugs crop up or for generating patches.
>>>>>
>>>>>
>>>>> The ability to reproduce the same decisions when debugging the compiler
>>>>> is critical. Even the proposal of re-keying our RNG on a per-pass basis is
>>>>> far from perfect, it allows us to narrow down the passes but not the actual
>>>>> input source code. If we remove a few lines from the middle of a function
>>>>> then the RNG stream will get out of sync and that may mask the bug. Solving
>>>>> that too would be fantastic. :) Realistically I'm relying on random chance
>>>>> to allow us to reduce the code down to a reasonably sized testcase.
>>>>
>>>>
>>>> Thus my suggestion of relying on an external stream of randomness.
>>>> Something as simple as:
>>>>
>>>> dd if=/dev/urandom of=/my/totes/random/data bs=1M count=100
>>>>
>>>> gets you a totally reproducible build.
>>>
>>>
>>> Fine, but how do we turn off the middle pieces of the compiler and still
>>> have reproducible behaviour for the latter parts? The obvious answer is to
>>> have each llvm pass restart from the beginning of the stream, but that has a
>>> new problem in that random choices will be correlated across the compiler.
>>> Can we solve that? Would it be enough to use cs-hash(strong but correlated
>>> random, seeded but non-CS PRNG)?
>>
>>
>> ISTM that any concern about maintaining a CSPRNG would also apply to the
>> other hard-to-verify properties of a PRNG. I'm not sure it's a good idea to
>> try to absorb that challenge, particularly since every fix to it down the
>> road will entail exactly this discussion again.
>>
>> The approach I've advocated for below would solve this by allowing you to
>> stop, persist the last read point, and then restart at an arbitrary time
>> from that point.
>>
>>>
>>>> As an added bonus, maintaining counters for rng bytes consumed during the
>>>> process would allow you to chop up the process simply by
>>>> adding/removing/moving the corresponding bits in the randomness source.
>>>
>>>
>>> That means that tools like bugpoint would have to learn that there is a
>>> random number generator and query its state and manipulate it appropriately.
>>> At a high level I don't see any reason this wouldn't work, but implementing
>>> it is going to be a royal pain. Bugpoint's operations must be represented as
>>> command-line runs of opt. We could add "tell us how many bytes have been
>>> consumed at this point" and "consume X bytes of RNG" passes, but we need one
>>> of each type of pass in order to prevent us from perturbing the pass
>>> structure ((or perhaps add it into the PassManager itself?)), and we need
>>> some way to pass the "bytes to skip" value into each of these passes
>>> independently, which we don't have today through the command line. This is
>>> at least, a solvable problem with regular non-cryptographic engineering.
>>> Probably smarter than trying to design our own restartable CSPRNG system.
>>
>>
>> I would probably just have this be part of the randomness stream itself-
>> have it take two files, one of which is the stream and the other one of
>> which is metadata about where to stop/start/etc. Replaying a stream from a
>> checkpoint is then as simple as copying off the metadata file at that point
>> for later use and then providing it again when the next step is to be
>> performed. Conceptually this could also be managed in the code itself by
>> maintaining an awareness of who was draining from the RNG stream, but it
>> seems more complicated for not much gain to me.
>>
>>>
>>>
>>> Have I missed any reason this doesn't work?
>>>
>>> Nick
>>
>>
>>
>> _______________________________________________
>> llvm-commits mailing list
>> llvm-commits at cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvm-commits
>>