[LLVMdev] RFC: Machine Level IR text-based serialization format

David Majnemer david.majnemer at gmail.com
Tue Apr 28 17:59:08 PDT 2015


On Tuesday, April 28, 2015, Sean Silva <chisophugis at gmail.com> wrote:

>
>
> On Tue, Apr 28, 2015 at 3:51 PM, David Majnemer <david.majnemer at gmail.com
> <javascript:_e(%7B%7D,'cvml','david.majnemer at gmail.com');>> wrote:
>
>> I love the idea of having some sort of textual representation.  My only
>> concern is that our YAML parser is not very actively maintained (is there
>> someone expert with its implementation *and* active in the project?) and
>> (IMHO) over-engineered when compared to the simplicity of our custom IR
>> parser.
>>
>> Without TLC, I'm afraid it would make for a poor piece of LLVM
>> infrastructure to rely on.  The reliability of the serialization mechanism
>> is very important if we are to have any chance of applying fuzz testing to
>> the backend pieces; after all, testability is a huge motivation for this
>> work.
>>
>> As a concrete example, a file solely containing '%' crashes the yaml
>> parser:
>> $ ~/llvm/Debug+Asserts/bin/yaml2obj -format=coff t.yaml
>> yaml2obj: ~/llvm/src/lib/Support/YAMLTraits.cpp:78: bool
>> llvm::yaml::Input::setCurrentDocument(): Assertion `Strm->failed() && "Root
>> is NULL iff parsing failed"' failed.
>> 0  yaml2obj        0x000000000048682e
>> 1  yaml2obj        0x0000000000486b43
>> 2  yaml2obj        0x000000000048570e
>> 3  libpthread.so.0 0x00007f5e79643340
>> 4  libc.so.6       0x00007f5e78c9acc9 gsignal + 57
>> 5  libc.so.6       0x00007f5e78c9e0d8 abort + 328
>> 6  libc.so.6       0x00007f5e78c93b86
>> 7  libc.so.6       0x00007f5e78c93c32
>> 8  yaml2obj        0x000000000045f378
>> 9  yaml2obj        0x000000000040d4b3
>> 10 yaml2obj        0x000000000040b0fa
>> 11 yaml2obj        0x0000000000404a79
>> 12 yaml2obj        0x0000000000404dd8
>> 13 libc.so.6       0x00007f5e78c85ec5 __libc_start_main + 245
>> 14 yaml2obj        0x0000000000404879
>> Stack dump:
>> 0.      Program arguments: ~/llvm/Debug+Asserts/bin/yaml2obj -format=coff
>> t.yaml
>>
>>
>
> Hopefully a fuzzer that is fuzzing a yaml input would not waste its time
> with syntactically invalid or unusual YAML.
>

Maybe.  I don't see why we would want to lock ourselves out of using
afl-fuzz though.


>
> Also, you're thinking of YAMLIO which is a layer on top of the YAML parser
> (YAMLParser.{h,cpp}). It might make sense to not use YAMLIO (it is good for
> some types of data, not for all) but still use the YAML parser.
>
> -- Sean Silva
>
>
>> On Tue, Apr 28, 2015 at 2:00 PM, Alex L <arphaman at gmail.com
>> <javascript:_e(%7B%7D,'cvml','arphaman at gmail.com');>> wrote:
>>
>>>
>>>
>>> 2015-04-28 10:14 GMT-07:00 Quentin Colombet <qcolombet at apple.com
>>> <javascript:_e(%7B%7D,'cvml','qcolombet at apple.com');>>:
>>>
>>>> Hi Alex,
>>>>
>>>> Thanks for working on this.
>>>>
>>>> Personally I would rather not have to write YAML inputs but instead
>>>> resort on the what the machine dumps look like. That being said, I can live
>>>> with YAML :).
>>>>
>>>> More importantly, how do you plan to report syntax errors to the users?
>>>> Things like invalid instruction, invalid registers, etc.?
>>>> What about unallocated code, i.e., virtual registers, invalid SSA form,
>>>> etc.?
>>>>
>>>> Cheers,
>>>> Q.
>>>>
>>>
>>> Thanks,
>>>
>>> Unfortunately, the machine dumps are quite incomplete (and tricky to
>>> parse too!), and thus some sort of new syntax has to be developed.
>>> I think that a YAML based container is a good candidate for this
>>> purpose, as it has a structured format that represents things like machine
>>> functions,
>>> frame information, register information, target specific machine
>>> function details, etc in a clear and readable way.
>>>
>>> I haven't thought about error reporting that much, as I've been mostly
>>> working on developing the syntax and making sure that all the data
>>> structures
>>> can be represented by it. But I believe that the errors that crop up in
>>> an invalid machine instruction syntax, like invalid basic block references,
>>> invalid instructions,
>>> etc. can be reported quite well and I can rely on already existing error
>>> reporting facilities in LLVM to help me. The more structural errors, like
>>> missing attributes
>>> will be handled by the YAML parser automatically, and I might extend it
>>> to provide better/more specific error messages. And I think that it's
>>> possible
>>> to use the machine verifier to catch the other errors that you've
>>> mentioned.
>>>
>>> Alex
>>>
>>>
>>>
>>>> On Apr 28, 2015, at 9:56 AM, Alex L <arphaman at gmail.com
>>>> <javascript:_e(%7B%7D,'cvml','arphaman at gmail.com');>> wrote:
>>>>
>>>> Hi all,
>>>>
>>>>
>>>> I would like to propose a text-based, human readable format that will be used to
>>>>
>>>> serialize the machine level IR. The major goal of this format is to allow LLVM
>>>>
>>>> to save the machine level IR after any code generation pass and then to load
>>>>
>>>> it again and continue running passes on the machine level IR. The primary use case
>>>>
>>>> of this format is to enable easier testing process for the code generation passes,
>>>>
>>>> by allowing the developers to write tests that load the IR, then invoke just a
>>>>
>>>> specific code gen pass and then inspect the output of that pass by checking the
>>>>
>>>> printed out IR.
>>>>
>>>>
>>>>
>>>> The proposed format has a number of key features:
>>>>
>>>> - It stores the machine level IR and the optional LLVM IR in one text file.
>>>>
>>>> - The connections between the machine level IR and the LLVM IR are preserved.
>>>>
>>>> - The format uses a YAML based container for most of the data structures. The LLVM
>>>>
>>>>   IR is embedded in the YAML container.
>>>>
>>>> - The format also uses a new, text-based syntax to serialize the machine instructions.
>>>>
>>>>   The instructions are embedded in YAML.
>>>>
>>>>
>>>> This is an incomplete example of a YAML file containing the LLVM IR, the machine level IR
>>>>
>>>> and the instructions:
>>>>
>>>>
>>>> ---
>>>>
>>>> ir: |
>>>>
>>>>   define i32 @fact(i32 %n) {
>>>>
>>>>     %1 = alloca i32, align 4
>>>>
>>>>     store i32 %n, i32* %1, align 4
>>>>
>>>>     %2 = load i32, i32* %1, align 4
>>>>
>>>>     %3 = icmp eq i32 %2, 0
>>>>
>>>>     br i1 %3, label %10, label %4
>>>>
>>>>
>>>>   ; <label>:4                                       ; preds = %0
>>>>
>>>>     %5 = load i32, i32* %1, align 4
>>>>
>>>>     %6 = sub nsw i32 %5, 1
>>>>
>>>>     %7 = call i32 @fact(i32 %6)
>>>>
>>>>     %8 = load i32, i32* %1, align 4
>>>>
>>>>     %9 = mul nsw i32 %7, %8
>>>>
>>>>     br label %10
>>>>
>>>>
>>>>   ; <label>:10                                      ; preds = %0, %4
>>>>
>>>>     %11 = phi i32 [ %9, %4 ], [ 1, %0 ]
>>>>
>>>>     ret i32 %11
>>>>
>>>>   }
>>>>
>>>>
>>>> ...
>>>>
>>>> ---
>>>>
>>>> number:          0
>>>>
>>>> name:            fact
>>>>
>>>> alignment:       4
>>>>
>>>> regInfo:
>>>>
>>>>   ....
>>>>
>>>> frameInfo:
>>>>
>>>>   ....
>>>>
>>>> body:
>>>>
>>>>   - bb:              0
>>>>
>>>>     llbb:            '%0'
>>>>
>>>>     successors:      [ 'bb#2', 'bb#1' ]
>>>>
>>>>     liveIns:         [ '%edi' ]
>>>>
>>>>     instructions:
>>>>
>>>>       - 'push64r undef %rax, %rsp, %rsp'
>>>>
>>>>       - 'mov32mr %rsp, 1, %noreg, 4, %noreg, %edi'
>>>>
>>>>       - ....
>>>>
>>>>         ....
>>>>
>>>>   - bb:              1
>>>>
>>>>     llbb:            '%4'
>>>>
>>>>     successors:      [ 'bb#2' ]
>>>>
>>>>     instructions:
>>>>
>>>>       - '%edi = mov32rm %rsp, 1, %noreg, 4, %noreg'
>>>>
>>>>       - ....
>>>>
>>>>         ....
>>>>
>>>>   - ....
>>>>
>>>>     ....
>>>>
>>>> ...
>>>>
>>>>
>>>> The example above shows a YAML file with two YAML documents (delimited by `---`
>>>>
>>>> and `...`) containing the LLVM IR and the machine function information for the function `fact`.
>>>>
>>>>
>>>>
>>>> When a specific format is chosen, I'll start with patches that serialize the
>>>>
>>>> embedded LLVM IR. Then I'll add support for things like machine functions and
>>>>
>>>> machine basic blocks, and I think that an intrusive implementation will work best
>>>>
>>>> for data structures like these. After that I will continue adding support for
>>>>
>>>> serialization of the remaining data structures.
>>>>
>>>>
>>>>
>>>> Thanks for reading through the proposal. What are you thoughts about this format?
>>>>
>>>>
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> LLVMdev at cs.uiuc.edu
>>>> <javascript:_e(%7B%7D,'cvml','LLVMdev at cs.uiuc.edu');>
>>>> http://llvm.cs.uiuc.edu
>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>>
>>>>
>>>>
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> LLVMdev at cs.uiuc.edu
>>> <javascript:_e(%7B%7D,'cvml','LLVMdev at cs.uiuc.edu');>
>>> http://llvm.cs.uiuc.edu
>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>
>>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu <javascript:_e(%7B%7D,'cvml','LLVMdev at cs.uiuc.edu');>
>>        http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150428/7836d8e5/attachment.html>


More information about the llvm-dev mailing list