[LLVMdev] [PATCH][RFC] Add llvm.codegen Intrinsic To Support Embedded LLVM IR Code Generation

Thu May 10 02:25:56 PDT 2012

On 05/09/2012 11:15 PM, Evan Cheng wrote:
>
> On May 9, 2012, at 2:12 AM, Tobias Grosser wrote:
>
>>
>>>> That's why I was asking you where you see the possibility of illegal/malicious code? You did not really explain it yet and I would
>>>> be more than happy to be understand such a problem. From my point of view embedded and host module code are both compiled at the same time and are both checked by the LLVM bitcode verifier. How could this introduce any malicious code, that could not be introduced by normal LLVM-IR?
>>>
>>> You're adding a feature that embed code inside a module. When the module is loaded, is the string going to be verified? How are users of LLVM IR able to ensure the embedded string is safe? I am not saying it cannot be done. This feature just increases the risk and that again raises the bar for acceptance.
>>
>> What do you mean by verified? How is normal LLVM-IR verified?
>> What do you mean by ensuring an embedded string is safe? How do you ensure normal LLVM-IR is safe?
>>
>> The only existing kind of verification I am aware of is the '-verify' pass that checks an LLVM-IR module. This pass is run over the embedded module at the same time as target code is generated for the host function. In case the verification fails, no target code is generated and an empty string is returned. In case target code is generated it is
>> stored back in memory. It can obviously be executed through a function
>> pointer, but this is not different than executing code that is stored through other means in memory.
>>
>> I am kind of surprised security is a concern here. If we really want to do a proper risk analysis, we should first define the security guarantees LLVM gives. I am kind of surprised such security guarantees exist. To me securely verifying LLVM-IR is difficult for other reasons than this intrinsic. Google PNaCL does, for good reasons, not rely on LLVM to provide security guarantees.
>>
>> Still, if this is a concern we could make this intrinsic a target option that is disabled by default.
>
>
> You are missing the point. Don't think in turns of existing implementations. Don't think in turns of clang or other static compilers. There are plenty of systems which use LLVM out there. There can be plenty of different ways to verify / check LLVM IR. We don't know about them.
>
> A LLVM bitcode module as it is today is a representation of some program. It has semantics that are clearly defined by its instructions and data. Now you want to embed some other programs in strings. That makes the IR inherently harder to understand, it's more risky by definition. Of course systems which use LLVM can solve this problem. But it's a big fundamental change and I (and other people on this thread) has pointed out the benefits are just not worth it.

Hi Evan,

I really cannot follow you. It is already possible to store arbitrary 
program code in the data sections of an LLVM bitcode module and to call 
a function pointer pointing to this data. The llvm.codegen() intrinsic 
does not make this any more dangerous. The only feature it currently 
provides is to translate an LLVM-IR string from LLVM-IR to an assembly 
string. This is a data-section only modification. Calling such data 
section _may_ introduce security problems, but the very same problems 
exist if the string that would have been generated by the llvm.codegen() 
is directly embedded into the module. So no, I cannot see how the 
intrinsic would break any existing security guarantees.

Even though I do not share your security concerns, I see that people are 
concerned about the readability of the embedded LLVM-IR and believe that 
the intrinsic is too project specific. As I was not able to address 
these concerns properly, I currently won't push for this intrinsic further.

Thanks for your comments
Tobi