[LLVMdev] RFC: Machine Level IR text-based serialization format
Hayden Livingston
halivingston at gmail.com
Tue Apr 28 22:35:39 PDT 2015
As an aside, you haven't mentioned but will the IR parser be rewritten
at all? Is the YAML a container on top of the IR?
If you are rewriting the IR parser, would it be possible to maintain
some sort of grammar?
On Tue, Apr 28, 2015 at 5:59 PM, David Majnemer
<david.majnemer at gmail.com> wrote:
>
>
> On Tuesday, April 28, 2015, Sean Silva <chisophugis at gmail.com> wrote:
>>
>>
>>
>> On Tue, Apr 28, 2015 at 3:51 PM, David Majnemer <david.majnemer at gmail.com>
>> wrote:
>>>
>>> I love the idea of having some sort of textual representation. My only
>>> concern is that our YAML parser is not very actively maintained (is there
>>> someone expert with its implementation *and* active in the project?) and
>>> (IMHO) over-engineered when compared to the simplicity of our custom IR
>>> parser.
>>>
>>> Without TLC, I'm afraid it would make for a poor piece of LLVM
>>> infrastructure to rely on. The reliability of the serialization mechanism
>>> is very important if we are to have any chance of applying fuzz testing to
>>> the backend pieces; after all, testability is a huge motivation for this
>>> work.
>>>
>>> As a concrete example, a file solely containing '%' crashes the yaml
>>> parser:
>>> $ ~/llvm/Debug+Asserts/bin/yaml2obj -format=coff t.yaml
>>> yaml2obj: ~/llvm/src/lib/Support/YAMLTraits.cpp:78: bool
>>> llvm::yaml::Input::setCurrentDocument(): Assertion `Strm->failed() && "Root
>>> is NULL iff parsing failed"' failed.
>>> 0 yaml2obj 0x000000000048682e
>>> 1 yaml2obj 0x0000000000486b43
>>> 2 yaml2obj 0x000000000048570e
>>> 3 libpthread.so.0 0x00007f5e79643340
>>> 4 libc.so.6 0x00007f5e78c9acc9 gsignal + 57
>>> 5 libc.so.6 0x00007f5e78c9e0d8 abort + 328
>>> 6 libc.so.6 0x00007f5e78c93b86
>>> 7 libc.so.6 0x00007f5e78c93c32
>>> 8 yaml2obj 0x000000000045f378
>>> 9 yaml2obj 0x000000000040d4b3
>>> 10 yaml2obj 0x000000000040b0fa
>>> 11 yaml2obj 0x0000000000404a79
>>> 12 yaml2obj 0x0000000000404dd8
>>> 13 libc.so.6 0x00007f5e78c85ec5 __libc_start_main + 245
>>> 14 yaml2obj 0x0000000000404879
>>> Stack dump:
>>> 0. Program arguments: ~/llvm/Debug+Asserts/bin/yaml2obj -format=coff
>>> t.yaml
>>>
>>
>>
>> Hopefully a fuzzer that is fuzzing a yaml input would not waste its time
>> with syntactically invalid or unusual YAML.
>
>
> Maybe. I don't see why we would want to lock ourselves out of using
> afl-fuzz though.
>
>>
>>
>> Also, you're thinking of YAMLIO which is a layer on top of the YAML parser
>> (YAMLParser.{h,cpp}). It might make sense to not use YAMLIO (it is good for
>> some types of data, not for all) but still use the YAML parser.
>>
>> -- Sean Silva
>>
>>>
>>> On Tue, Apr 28, 2015 at 2:00 PM, Alex L <arphaman at gmail.com> wrote:
>>>>
>>>>
>>>>
>>>> 2015-04-28 10:14 GMT-07:00 Quentin Colombet <qcolombet at apple.com>:
>>>>>
>>>>> Hi Alex,
>>>>>
>>>>> Thanks for working on this.
>>>>>
>>>>> Personally I would rather not have to write YAML inputs but instead
>>>>> resort on the what the machine dumps look like. That being said, I can live
>>>>> with YAML :).
>>>>>
>>>>> More importantly, how do you plan to report syntax errors to the users?
>>>>> Things like invalid instruction, invalid registers, etc.?
>>>>> What about unallocated code, i.e., virtual registers, invalid SSA form,
>>>>> etc.?
>>>>>
>>>>> Cheers,
>>>>> Q.
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Unfortunately, the machine dumps are quite incomplete (and tricky to
>>>> parse too!), and thus some sort of new syntax has to be developed.
>>>> I think that a YAML based container is a good candidate for this
>>>> purpose, as it has a structured format that represents things like machine
>>>> functions,
>>>> frame information, register information, target specific machine
>>>> function details, etc in a clear and readable way.
>>>>
>>>> I haven't thought about error reporting that much, as I've been mostly
>>>> working on developing the syntax and making sure that all the data
>>>> structures
>>>> can be represented by it. But I believe that the errors that crop up in
>>>> an invalid machine instruction syntax, like invalid basic block references,
>>>> invalid instructions,
>>>> etc. can be reported quite well and I can rely on already existing error
>>>> reporting facilities in LLVM to help me. The more structural errors, like
>>>> missing attributes
>>>> will be handled by the YAML parser automatically, and I might extend it
>>>> to provide better/more specific error messages. And I think that it's
>>>> possible
>>>> to use the machine verifier to catch the other errors that you've
>>>> mentioned.
>>>>
>>>> Alex
>>>>
>>>>
>>>>>
>>>>> On Apr 28, 2015, at 9:56 AM, Alex L <arphaman at gmail.com> wrote:
>>>>>
>>>>> Hi all,
>>>>>
>>>>>
>>>>> I would like to propose a text-based, human readable format that will
>>>>> be used to
>>>>>
>>>>> serialize the machine level IR. The major goal of this format is to
>>>>> allow LLVM
>>>>>
>>>>> to save the machine level IR after any code generation pass and then to
>>>>> load
>>>>>
>>>>> it again and continue running passes on the machine level IR. The
>>>>> primary use case
>>>>>
>>>>> of this format is to enable easier testing process for the code
>>>>> generation passes,
>>>>>
>>>>> by allowing the developers to write tests that load the IR, then invoke
>>>>> just a
>>>>>
>>>>> specific code gen pass and then inspect the output of that pass by
>>>>> checking the
>>>>>
>>>>> printed out IR.
>>>>>
>>>>>
>>>>>
>>>>> The proposed format has a number of key features:
>>>>>
>>>>> - It stores the machine level IR and the optional LLVM IR in one text
>>>>> file.
>>>>>
>>>>> - The connections between the machine level IR and the LLVM IR are
>>>>> preserved.
>>>>>
>>>>> - The format uses a YAML based container for most of the data
>>>>> structures. The LLVM
>>>>>
>>>>> IR is embedded in the YAML container.
>>>>>
>>>>> - The format also uses a new, text-based syntax to serialize the
>>>>> machine instructions.
>>>>>
>>>>> The instructions are embedded in YAML.
>>>>>
>>>>>
>>>>> This is an incomplete example of a YAML file containing the LLVM IR,
>>>>> the machine level IR
>>>>>
>>>>> and the instructions:
>>>>>
>>>>>
>>>>> ---
>>>>>
>>>>> ir: |
>>>>>
>>>>> define i32 @fact(i32 %n) {
>>>>>
>>>>> %1 = alloca i32, align 4
>>>>>
>>>>> store i32 %n, i32* %1, align 4
>>>>>
>>>>> %2 = load i32, i32* %1, align 4
>>>>>
>>>>> %3 = icmp eq i32 %2, 0
>>>>>
>>>>> br i1 %3, label %10, label %4
>>>>>
>>>>>
>>>>> ; <label>:4 ; preds = %0
>>>>>
>>>>> %5 = load i32, i32* %1, align 4
>>>>>
>>>>> %6 = sub nsw i32 %5, 1
>>>>>
>>>>> %7 = call i32 @fact(i32 %6)
>>>>>
>>>>> %8 = load i32, i32* %1, align 4
>>>>>
>>>>> %9 = mul nsw i32 %7, %8
>>>>>
>>>>> br label %10
>>>>>
>>>>>
>>>>> ; <label>:10 ; preds = %0, %4
>>>>>
>>>>> %11 = phi i32 [ %9, %4 ], [ 1, %0 ]
>>>>>
>>>>> ret i32 %11
>>>>>
>>>>> }
>>>>>
>>>>>
>>>>> ...
>>>>>
>>>>> ---
>>>>>
>>>>> number: 0
>>>>>
>>>>> name: fact
>>>>>
>>>>> alignment: 4
>>>>>
>>>>> regInfo:
>>>>>
>>>>> ....
>>>>>
>>>>> frameInfo:
>>>>>
>>>>> ....
>>>>>
>>>>> body:
>>>>>
>>>>> - bb: 0
>>>>>
>>>>> llbb: '%0'
>>>>>
>>>>> successors: [ 'bb#2', 'bb#1' ]
>>>>>
>>>>> liveIns: [ '%edi' ]
>>>>>
>>>>> instructions:
>>>>>
>>>>> - 'push64r undef %rax, %rsp, %rsp'
>>>>>
>>>>> - 'mov32mr %rsp, 1, %noreg, 4, %noreg, %edi'
>>>>>
>>>>> - ....
>>>>>
>>>>> ....
>>>>>
>>>>> - bb: 1
>>>>>
>>>>> llbb: '%4'
>>>>>
>>>>> successors: [ 'bb#2' ]
>>>>>
>>>>> instructions:
>>>>>
>>>>> - '%edi = mov32rm %rsp, 1, %noreg, 4, %noreg'
>>>>>
>>>>> - ....
>>>>>
>>>>> ....
>>>>>
>>>>> - ....
>>>>>
>>>>> ....
>>>>>
>>>>> ...
>>>>>
>>>>>
>>>>> The example above shows a YAML file with two YAML documents (delimited
>>>>> by `---`
>>>>>
>>>>> and `...`) containing the LLVM IR and the machine function information
>>>>> for the function `fact`.
>>>>>
>>>>>
>>>>>
>>>>> When a specific format is chosen, I'll start with patches that
>>>>> serialize the
>>>>>
>>>>> embedded LLVM IR. Then I'll add support for things like machine
>>>>> functions and
>>>>>
>>>>> machine basic blocks, and I think that an intrusive implementation will
>>>>> work best
>>>>>
>>>>> for data structures like these. After that I will continue adding
>>>>> support for
>>>>>
>>>>> serialization of the remaining data structures.
>>>>>
>>>>>
>>>>>
>>>>> Thanks for reading through the proposal. What are you thoughts about
>>>>> this format?
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> LLVM Developers mailing list
>>>>> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>>>
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>>
>>>
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>
>>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
More information about the llvm-dev
mailing list