[LLVMdev] RFC: Machine Level IR text-based serialization format
Sean Silva
chisophugis at gmail.com
Tue Apr 28 17:36:57 PDT 2015
On Tue, Apr 28, 2015 at 3:51 PM, David Majnemer <david.majnemer at gmail.com>
wrote:
> I love the idea of having some sort of textual representation. My only
> concern is that our YAML parser is not very actively maintained (is there
> someone expert with its implementation *and* active in the project?) and
> (IMHO) over-engineered when compared to the simplicity of our custom IR
> parser.
>
> Without TLC, I'm afraid it would make for a poor piece of LLVM
> infrastructure to rely on. The reliability of the serialization mechanism
> is very important if we are to have any chance of applying fuzz testing to
> the backend pieces; after all, testability is a huge motivation for this
> work.
>
> As a concrete example, a file solely containing '%' crashes the yaml
> parser:
> $ ~/llvm/Debug+Asserts/bin/yaml2obj -format=coff t.yaml
> yaml2obj: ~/llvm/src/lib/Support/YAMLTraits.cpp:78: bool
> llvm::yaml::Input::setCurrentDocument(): Assertion `Strm->failed() && "Root
> is NULL iff parsing failed"' failed.
> 0 yaml2obj 0x000000000048682e
> 1 yaml2obj 0x0000000000486b43
> 2 yaml2obj 0x000000000048570e
> 3 libpthread.so.0 0x00007f5e79643340
> 4 libc.so.6 0x00007f5e78c9acc9 gsignal + 57
> 5 libc.so.6 0x00007f5e78c9e0d8 abort + 328
> 6 libc.so.6 0x00007f5e78c93b86
> 7 libc.so.6 0x00007f5e78c93c32
> 8 yaml2obj 0x000000000045f378
> 9 yaml2obj 0x000000000040d4b3
> 10 yaml2obj 0x000000000040b0fa
> 11 yaml2obj 0x0000000000404a79
> 12 yaml2obj 0x0000000000404dd8
> 13 libc.so.6 0x00007f5e78c85ec5 __libc_start_main + 245
> 14 yaml2obj 0x0000000000404879
> Stack dump:
> 0. Program arguments: ~/llvm/Debug+Asserts/bin/yaml2obj -format=coff
> t.yaml
>
>
Hopefully a fuzzer that is fuzzing a yaml input would not waste its time
with syntactically invalid or unusual YAML.
Also, you're thinking of YAMLIO which is a layer on top of the YAML parser
(YAMLParser.{h,cpp}). It might make sense to not use YAMLIO (it is good for
some types of data, not for all) but still use the YAML parser.
-- Sean Silva
> On Tue, Apr 28, 2015 at 2:00 PM, Alex L <arphaman at gmail.com> wrote:
>
>>
>>
>> 2015-04-28 10:14 GMT-07:00 Quentin Colombet <qcolombet at apple.com>:
>>
>>> Hi Alex,
>>>
>>> Thanks for working on this.
>>>
>>> Personally I would rather not have to write YAML inputs but instead
>>> resort on the what the machine dumps look like. That being said, I can live
>>> with YAML :).
>>>
>>> More importantly, how do you plan to report syntax errors to the users?
>>> Things like invalid instruction, invalid registers, etc.?
>>> What about unallocated code, i.e., virtual registers, invalid SSA form,
>>> etc.?
>>>
>>> Cheers,
>>> Q.
>>>
>>
>> Thanks,
>>
>> Unfortunately, the machine dumps are quite incomplete (and tricky to
>> parse too!), and thus some sort of new syntax has to be developed.
>> I think that a YAML based container is a good candidate for this purpose,
>> as it has a structured format that represents things like machine functions,
>> frame information, register information, target specific machine function
>> details, etc in a clear and readable way.
>>
>> I haven't thought about error reporting that much, as I've been mostly
>> working on developing the syntax and making sure that all the data
>> structures
>> can be represented by it. But I believe that the errors that crop up in
>> an invalid machine instruction syntax, like invalid basic block references,
>> invalid instructions,
>> etc. can be reported quite well and I can rely on already existing error
>> reporting facilities in LLVM to help me. The more structural errors, like
>> missing attributes
>> will be handled by the YAML parser automatically, and I might extend it
>> to provide better/more specific error messages. And I think that it's
>> possible
>> to use the machine verifier to catch the other errors that you've
>> mentioned.
>>
>> Alex
>>
>>
>>
>>> On Apr 28, 2015, at 9:56 AM, Alex L <arphaman at gmail.com> wrote:
>>>
>>> Hi all,
>>>
>>>
>>> I would like to propose a text-based, human readable format that will be used to
>>>
>>> serialize the machine level IR. The major goal of this format is to allow LLVM
>>>
>>> to save the machine level IR after any code generation pass and then to load
>>>
>>> it again and continue running passes on the machine level IR. The primary use case
>>>
>>> of this format is to enable easier testing process for the code generation passes,
>>>
>>> by allowing the developers to write tests that load the IR, then invoke just a
>>>
>>> specific code gen pass and then inspect the output of that pass by checking the
>>>
>>> printed out IR.
>>>
>>>
>>>
>>> The proposed format has a number of key features:
>>>
>>> - It stores the machine level IR and the optional LLVM IR in one text file.
>>>
>>> - The connections between the machine level IR and the LLVM IR are preserved.
>>>
>>> - The format uses a YAML based container for most of the data structures. The LLVM
>>>
>>> IR is embedded in the YAML container.
>>>
>>> - The format also uses a new, text-based syntax to serialize the machine instructions.
>>>
>>> The instructions are embedded in YAML.
>>>
>>>
>>> This is an incomplete example of a YAML file containing the LLVM IR, the machine level IR
>>>
>>> and the instructions:
>>>
>>>
>>> ---
>>>
>>> ir: |
>>>
>>> define i32 @fact(i32 %n) {
>>>
>>> %1 = alloca i32, align 4
>>>
>>> store i32 %n, i32* %1, align 4
>>>
>>> %2 = load i32, i32* %1, align 4
>>>
>>> %3 = icmp eq i32 %2, 0
>>>
>>> br i1 %3, label %10, label %4
>>>
>>>
>>> ; <label>:4 ; preds = %0
>>>
>>> %5 = load i32, i32* %1, align 4
>>>
>>> %6 = sub nsw i32 %5, 1
>>>
>>> %7 = call i32 @fact(i32 %6)
>>>
>>> %8 = load i32, i32* %1, align 4
>>>
>>> %9 = mul nsw i32 %7, %8
>>>
>>> br label %10
>>>
>>>
>>> ; <label>:10 ; preds = %0, %4
>>>
>>> %11 = phi i32 [ %9, %4 ], [ 1, %0 ]
>>>
>>> ret i32 %11
>>>
>>> }
>>>
>>>
>>> ...
>>>
>>> ---
>>>
>>> number: 0
>>>
>>> name: fact
>>>
>>> alignment: 4
>>>
>>> regInfo:
>>>
>>> ....
>>>
>>> frameInfo:
>>>
>>> ....
>>>
>>> body:
>>>
>>> - bb: 0
>>>
>>> llbb: '%0'
>>>
>>> successors: [ 'bb#2', 'bb#1' ]
>>>
>>> liveIns: [ '%edi' ]
>>>
>>> instructions:
>>>
>>> - 'push64r undef %rax, %rsp, %rsp'
>>>
>>> - 'mov32mr %rsp, 1, %noreg, 4, %noreg, %edi'
>>>
>>> - ....
>>>
>>> ....
>>>
>>> - bb: 1
>>>
>>> llbb: '%4'
>>>
>>> successors: [ 'bb#2' ]
>>>
>>> instructions:
>>>
>>> - '%edi = mov32rm %rsp, 1, %noreg, 4, %noreg'
>>>
>>> - ....
>>>
>>> ....
>>>
>>> - ....
>>>
>>> ....
>>>
>>> ...
>>>
>>>
>>> The example above shows a YAML file with two YAML documents (delimited by `---`
>>>
>>> and `...`) containing the LLVM IR and the machine function information for the function `fact`.
>>>
>>>
>>>
>>> When a specific format is chosen, I'll start with patches that serialize the
>>>
>>> embedded LLVM IR. Then I'll add support for things like machine functions and
>>>
>>> machine basic blocks, and I think that an intrusive implementation will work best
>>>
>>> for data structures like these. After that I will continue adding support for
>>>
>>> serialization of the remaining data structures.
>>>
>>>
>>>
>>> Thanks for reading through the proposal. What are you thoughts about this format?
>>>
>>>
>>> _______________________________________________
>>> LLVM Developers mailing list
>>> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>
>>>
>>>
>>
>> _______________________________________________
>> LLVM Developers mailing list
>> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>
>>
>
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150428/e4fea443/attachment.html>
More information about the llvm-dev
mailing list