[LLVMdev] RFC: Machine Level IR text-based serialization format

Tue Apr 28 22:35:39 PDT 2015

As an aside, you haven't mentioned but will the IR parser be rewritten
at all? Is the YAML a container on top of the IR?

If you are rewriting the IR parser, would it be possible to maintain
some sort of grammar?

On Tue, Apr 28, 2015 at 5:59 PM, David Majnemer
<david.majnemer at gmail.com> wrote:
>
>
> On Tuesday, April 28, 2015, Sean Silva <chisophugis at gmail.com> wrote:
>>
>>
>>
>> On Tue, Apr 28, 2015 at 3:51 PM, David Majnemer <david.majnemer at gmail.com>
>> wrote:
>>>
>>> I love the idea of having some sort of textual representation.  My only
>>> concern is that our YAML parser is not very actively maintained (is there
>>> someone expert with its implementation *and* active in the project?) and
>>> (IMHO) over-engineered when compared to the simplicity of our custom IR
>>> parser.
>>>
>>> Without TLC, I'm afraid it would make for a poor piece of LLVM
>>> infrastructure to rely on.  The reliability of the serialization mechanism
>>> is very important if we are to have any chance of applying fuzz testing to
>>> the backend pieces; after all, testability is a huge motivation for this
>>> work.
>>>
>>> As a concrete example, a file solely containing '%' crashes the yaml
>>> parser:
>>> $ ~/llvm/Debug+Asserts/bin/yaml2obj -format=coff t.yaml
>>> yaml2obj: ~/llvm/src/lib/Support/YAMLTraits.cpp:78: bool
>>> llvm::yaml::Input::setCurrentDocument(): Assertion `Strm->failed() && "Root
>>> is NULL iff parsing failed"' failed.
>>> 0  yaml2obj        0x000000000048682e
>>> 1  yaml2obj        0x0000000000486b43
>>> 2  yaml2obj        0x000000000048570e
>>> 3  libpthread.so.0 0x00007f5e79643340
>>> 4  libc.so.6       0x00007f5e78c9acc9 gsignal + 57
>>> 5  libc.so.6       0x00007f5e78c9e0d8 abort + 328
>>> 6  libc.so.6       0x00007f5e78c93b86
>>> 7  libc.so.6       0x00007f5e78c93c32
>>> 8  yaml2obj        0x000000000045f378
>>> 9  yaml2obj        0x000000000040d4b3
>>> 10 yaml2obj        0x000000000040b0fa
>>> 11 yaml2obj        0x0000000000404a79
>>> 12 yaml2obj        0x0000000000404dd8
>>> 13 libc.so.6       0x00007f5e78c85ec5 __libc_start_main + 245
>>> 14 yaml2obj        0x0000000000404879
>>> Stack dump:
>>> 0.      Program arguments: ~/llvm/Debug+Asserts/bin/yaml2obj -format=coff
>>> t.yaml
>>>
>>
>>
>> Hopefully a fuzzer that is fuzzing a yaml input would not waste its time
>> with syntactically invalid or unusual YAML.
>
>
> Maybe.  I don't see why we would want to lock ourselves out of using
> afl-fuzz though.
>
>>
>>
>> Also, you're thinking of YAMLIO which is a layer on top of the YAML parser
>> (YAMLParser.{h,cpp}). It might make sense to not use YAMLIO (it is good for
>> some types of data, not for all) but still use the YAML parser.
>>
>> -- Sean Silva
>>
>>>
>>> On Tue, Apr 28, 2015 at 2:00 PM, Alex L <arphaman at gmail.com> wrote:
>>>>
>>>>
>>>>
>>>> 2015-04-28 10:14 GMT-07:00 Quentin Colombet <qcolombet at apple.com>:
>>>>>
>>>>> Hi Alex,
>>>>>
>>>>> Thanks for working on this.
>>>>>
>>>>> Personally I would rather not have to write YAML inputs but instead
>>>>> resort on the what the machine dumps look like. That being said, I can live
>>>>> with YAML :).
>>>>>
>>>>> More importantly, how do you plan to report syntax errors to the users?
>>>>> Things like invalid instruction, invalid registers, etc.?
>>>>> What about unallocated code, i.e., virtual registers, invalid SSA form,
>>>>> etc.?
>>>>>
>>>>> Cheers,
>>>>> Q.
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Unfortunately, the machine dumps are quite incomplete (and tricky to
>>>> parse too!), and thus some sort of new syntax has to be developed.
>>>> I think that a YAML based container is a good candidate for this
>>>> purpose, as it has a structured format that represents things like machine
>>>> functions,
>>>> frame information, register information, target specific machine
>>>> function details, etc in a clear and readable way.
>>>>
>>>> I haven't thought about error reporting that much, as I've been mostly
>>>> working on developing the syntax and making sure that all the data
>>>> structures
>>>> can be represented by it. But I believe that the errors that crop up in
>>>> an invalid machine instruction syntax, like invalid basic block references,
>>>> invalid instructions,
>>>> etc. can be reported quite well and I can rely on already existing error
>>>> reporting facilities in LLVM to help me. The more structural errors, like
>>>> missing attributes
>>>> will be handled by the YAML parser automatically, and I might extend it
>>>> to provide better/more specific error messages. And I think that it's
>>>> possible
>>>> to use the machine verifier to catch the other errors that you've
>>>> mentioned.
>>>>
>>>> Alex
>>>>
>>>>
>>>>>
>>>>> On Apr 28, 2015, at 9:56 AM, Alex L <arphaman at gmail.com> wrote:
>>>>>
>>>>> Hi all,
>>>>>
>>>>>
>>>>> I would like to propose a text-based, human readable format that will
>>>>> be used to
>>>>>
>>>>> serialize the machine level IR. The major goal of this format is to
>>>>> allow LLVM
>>>>>
>>>>> to save the machine level IR after any code generation pass and then to
>>>>> load
>>>>>
>>>>> it again and continue running passes on the machine level IR. The
>>>>> primary use case
>>>>>
>>>>> of this format is to enable easier testing process for the code
>>>>> generation passes,
>>>>>
>>>>> by allowing the developers to write tests that load the IR, then invoke
>>>>> just a
>>>>>
>>>>> specific code gen pass and then inspect the output of that pass by
>>>>> checking the
>>>>>
>>>>> printed out IR.
>>>>>
>>>>>
>>>>>
>>>>> The proposed format has a number of key features:
>>>>>
>>>>> - It stores the machine level IR and the optional LLVM IR in one text
>>>>> file.
>>>>>
>>>>> - The connections between the machine level IR and the LLVM IR are
>>>>> preserved.
>>>>>
>>>>> - The format uses a YAML based container for most of the data
>>>>> structures. The LLVM
>>>>>
>>>>>   IR is embedded in the YAML container.
>>>>>
>>>>> - The format also uses a new, text-based syntax to serialize the
>>>>> machine instructions.
>>>>>
>>>>>   The instructions are embedded in YAML.
>>>>>
>>>>>
>>>>> This is an incomplete example of a YAML file containing the LLVM IR,
>>>>> the machine level IR
>>>>>
>>>>> and the instructions:
>>>>>
>>>>>
>>>>> ---
>>>>>
>>>>> ir: |
>>>>>
>>>>>   define i32 @fact(i32 %n) {
>>>>>
>>>>>     %1 = alloca i32, align 4
>>>>>
>>>>>     store i32 %n, i32* %1, align 4
>>>>>
>>>>>     %2 = load i32, i32* %1, align 4
>>>>>
>>>>>     %3 = icmp eq i32 %2, 0
>>>>>
>>>>>     br i1 %3, label %10, label %4
>>>>>
>>>>>
>>>>>   ; <label>:4                                       ; preds = %0
>>>>>
>>>>>     %5 = load i32, i32* %1, align 4
>>>>>
>>>>>     %6 = sub nsw i32 %5, 1
>>>>>
>>>>>     %7 = call i32 @fact(i32 %6)
>>>>>
>>>>>     %8 = load i32, i32* %1, align 4
>>>>>
>>>>>     %9 = mul nsw i32 %7, %8
>>>>>
>>>>>     br label %10
>>>>>
>>>>>
>>>>>   ; <label>:10                                      ; preds = %0, %4
>>>>>
>>>>>     %11 = phi i32 [ %9, %4 ], [ 1, %0 ]
>>>>>
>>>>>     ret i32 %11
>>>>>
>>>>>   }
>>>>>
>>>>>
>>>>> ...
>>>>>
>>>>> ---
>>>>>
>>>>> number:          0
>>>>>
>>>>> name:            fact
>>>>>
>>>>> alignment:       4
>>>>>
>>>>> regInfo:
>>>>>
>>>>>   ....
>>>>>
>>>>> frameInfo:
>>>>>
>>>>>   ....
>>>>>
>>>>> body:
>>>>>
>>>>>   - bb:              0
>>>>>
>>>>>     llbb:            '%0'
>>>>>
>>>>>     successors:      [ 'bb#2', 'bb#1' ]
>>>>>
>>>>>     liveIns:         [ '%edi' ]
>>>>>
>>>>>     instructions:
>>>>>
>>>>>       - 'push64r undef %rax, %rsp, %rsp'
>>>>>
>>>>>       - 'mov32mr %rsp, 1, %noreg, 4, %noreg, %edi'
>>>>>
>>>>>       - ....
>>>>>
>>>>>         ....
>>>>>
>>>>>   - bb:              1
>>>>>
>>>>>     llbb:            '%4'
>>>>>
>>>>>     successors:      [ 'bb#2' ]
>>>>>
>>>>>     instructions:
>>>>>
>>>>>       - '%edi = mov32rm %rsp, 1, %noreg, 4, %noreg'
>>>>>
>>>>>       - ....
>>>>>
>>>>>         ....
>>>>>
>>>>>   - ....
>>>>>
>>>>>     ....
>>>>>
>>>>> ...
>>>>>
>>>>>
>>>>> The example above shows a YAML file with two YAML documents (delimited
>>>>> by `---`
>>>>>
>>>>> and `...`) containing the LLVM IR and the machine function information
>>>>> for the function `fact`.
>>>>>
>>>>>
>>>>>
>>>>> When a specific format is chosen, I'll start with patches that
>>>>> serialize the
>>>>>
>>>>> embedded LLVM IR. Then I'll add support for things like machine
>>>>> functions and
>>>>>
>>>>> machine basic blocks, and I think that an intrusive implementation will
>>>>> work best
>>>>>
>>>>> for data structures like these. After that I will continue adding
>>>>> support for
>>>>>
>>>>> serialization of the remaining data structures.
>>>>>
>>>>>
>>>>>
>>>>> Thanks for reading through the proposal. What are you thoughts about
>>>>> this format?
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> LLVM Developers mailing list
>>>>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>>>
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> LLVM Developers mailing list
>>>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>>
>>>
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>
>>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>