[LLVMdev] RFC: Machine Level IR text-based serialization format

Tue Apr 28 22:52:57 PDT 2015

There's no reason to rewrite the IR parser.

-eric

On Tue, Apr 28, 2015, 10:39 PM Hayden Livingston <halivingston at gmail.com>
wrote:

> As an aside, you haven't mentioned but will the IR parser be rewritten
> at all? Is the YAML a container on top of the IR?
>
> If you are rewriting the IR parser, would it be possible to maintain
> some sort of grammar?
>
> On Tue, Apr 28, 2015 at 5:59 PM, David Majnemer
> <david.majnemer at gmail.com> wrote:
> >
> >
> > On Tuesday, April 28, 2015, Sean Silva <chisophugis at gmail.com> wrote:
> >>
> >>
> >>
> >> On Tue, Apr 28, 2015 at 3:51 PM, David Majnemer <
> david.majnemer at gmail.com>
> >> wrote:
> >>>
> >>> I love the idea of having some sort of textual representation.  My only
> >>> concern is that our YAML parser is not very actively maintained (is
> there
> >>> someone expert with its implementation *and* active in the project?)
> and
> >>> (IMHO) over-engineered when compared to the simplicity of our custom IR
> >>> parser.
> >>>
> >>> Without TLC, I'm afraid it would make for a poor piece of LLVM
> >>> infrastructure to rely on.  The reliability of the serialization
> mechanism
> >>> is very important if we are to have any chance of applying fuzz
> testing to
> >>> the backend pieces; after all, testability is a huge motivation for
> this
> >>> work.
> >>>
> >>> As a concrete example, a file solely containing '%' crashes the yaml
> >>> parser:
> >>> $ ~/llvm/Debug+Asserts/bin/yaml2obj -format=coff t.yaml
> >>> yaml2obj: ~/llvm/src/lib/Support/YAMLTraits.cpp:78: bool
> >>> llvm::yaml::Input::setCurrentDocument(): Assertion `Strm->failed() &&
> "Root
> >>> is NULL iff parsing failed"' failed.
> >>> 0  yaml2obj        0x000000000048682e
> >>> 1  yaml2obj        0x0000000000486b43
> >>> 2  yaml2obj        0x000000000048570e
> >>> 3  libpthread.so.0 0x00007f5e79643340
> >>> 4  libc.so.6       0x00007f5e78c9acc9 gsignal + 57
> >>> 5  libc.so.6       0x00007f5e78c9e0d8 abort + 328
> >>> 6  libc.so.6       0x00007f5e78c93b86
> >>> 7  libc.so.6       0x00007f5e78c93c32
> >>> 8  yaml2obj        0x000000000045f378
> >>> 9  yaml2obj        0x000000000040d4b3
> >>> 10 yaml2obj        0x000000000040b0fa
> >>> 11 yaml2obj        0x0000000000404a79
> >>> 12 yaml2obj        0x0000000000404dd8
> >>> 13 libc.so.6       0x00007f5e78c85ec5 __libc_start_main + 245
> >>> 14 yaml2obj        0x0000000000404879
> >>> Stack dump:
> >>> 0.      Program arguments: ~/llvm/Debug+Asserts/bin/yaml2obj
> -format=coff
> >>> t.yaml
> >>>
> >>
> >>
> >> Hopefully a fuzzer that is fuzzing a yaml input would not waste its time
> >> with syntactically invalid or unusual YAML.
> >
> >
> > Maybe.  I don't see why we would want to lock ourselves out of using
> > afl-fuzz though.
> >
> >>
> >>
> >> Also, you're thinking of YAMLIO which is a layer on top of the YAML
> parser
> >> (YAMLParser.{h,cpp}). It might make sense to not use YAMLIO (it is good
> for
> >> some types of data, not for all) but still use the YAML parser.
> >>
> >> -- Sean Silva
> >>
> >>>
> >>> On Tue, Apr 28, 2015 at 2:00 PM, Alex L <arphaman at gmail.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> 2015-04-28 10:14 GMT-07:00 Quentin Colombet <qcolombet at apple.com>:
> >>>>>
> >>>>> Hi Alex,
> >>>>>
> >>>>> Thanks for working on this.
> >>>>>
> >>>>> Personally I would rather not have to write YAML inputs but instead
> >>>>> resort on the what the machine dumps look like. That being said, I
> can live
> >>>>> with YAML :).
> >>>>>
> >>>>> More importantly, how do you plan to report syntax errors to the
> users?
> >>>>> Things like invalid instruction, invalid registers, etc.?
> >>>>> What about unallocated code, i.e., virtual registers, invalid SSA
> form,
> >>>>> etc.?
> >>>>>
> >>>>> Cheers,
> >>>>> Q.
> >>>>
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Unfortunately, the machine dumps are quite incomplete (and tricky to
> >>>> parse too!), and thus some sort of new syntax has to be developed.
> >>>> I think that a YAML based container is a good candidate for this
> >>>> purpose, as it has a structured format that represents things like
> machine
> >>>> functions,
> >>>> frame information, register information, target specific machine
> >>>> function details, etc in a clear and readable way.
> >>>>
> >>>> I haven't thought about error reporting that much, as I've been mostly
> >>>> working on developing the syntax and making sure that all the data
> >>>> structures
> >>>> can be represented by it. But I believe that the errors that crop up
> in
> >>>> an invalid machine instruction syntax, like invalid basic block
> references,
> >>>> invalid instructions,
> >>>> etc. can be reported quite well and I can rely on already existing
> error
> >>>> reporting facilities in LLVM to help me. The more structural errors,
> like
> >>>> missing attributes
> >>>> will be handled by the YAML parser automatically, and I might extend
> it
> >>>> to provide better/more specific error messages. And I think that it's
> >>>> possible
> >>>> to use the machine verifier to catch the other errors that you've
> >>>> mentioned.
> >>>>
> >>>> Alex
> >>>>
> >>>>
> >>>>>
> >>>>> On Apr 28, 2015, at 9:56 AM, Alex L <arphaman at gmail.com> wrote:
> >>>>>
> >>>>> Hi all,
> >>>>>
> >>>>>
> >>>>> I would like to propose a text-based, human readable format that will
> >>>>> be used to
> >>>>>
> >>>>> serialize the machine level IR. The major goal of this format is to
> >>>>> allow LLVM
> >>>>>
> >>>>> to save the machine level IR after any code generation pass and then
> to
> >>>>> load
> >>>>>
> >>>>> it again and continue running passes on the machine level IR. The
> >>>>> primary use case
> >>>>>
> >>>>> of this format is to enable easier testing process for the code
> >>>>> generation passes,
> >>>>>
> >>>>> by allowing the developers to write tests that load the IR, then
> invoke
> >>>>> just a
> >>>>>
> >>>>> specific code gen pass and then inspect the output of that pass by
> >>>>> checking the
> >>>>>
> >>>>> printed out IR.
> >>>>>
> >>>>>
> >>>>>
> >>>>> The proposed format has a number of key features:
> >>>>>
> >>>>> - It stores the machine level IR and the optional LLVM IR in one text
> >>>>> file.
> >>>>>
> >>>>> - The connections between the machine level IR and the LLVM IR are
> >>>>> preserved.
> >>>>>
> >>>>> - The format uses a YAML based container for most of the data
> >>>>> structures. The LLVM
> >>>>>
> >>>>>   IR is embedded in the YAML container.
> >>>>>
> >>>>> - The format also uses a new, text-based syntax to serialize the
> >>>>> machine instructions.
> >>>>>
> >>>>>   The instructions are embedded in YAML.
> >>>>>
> >>>>>
> >>>>> This is an incomplete example of a YAML file containing the LLVM IR,
> >>>>> the machine level IR
> >>>>>
> >>>>> and the instructions:
> >>>>>
> >>>>>
> >>>>> ---
> >>>>>
> >>>>> ir: |
> >>>>>
> >>>>>   define i32 @fact(i32 %n) {
> >>>>>
> >>>>>     %1 = alloca i32, align 4
> >>>>>
> >>>>>     store i32 %n, i32* %1, align 4
> >>>>>
> >>>>>     %2 = load i32, i32* %1, align 4
> >>>>>
> >>>>>     %3 = icmp eq i32 %2, 0
> >>>>>
> >>>>>     br i1 %3, label %10, label %4
> >>>>>
> >>>>>
> >>>>>   ; <label>:4                                       ; preds = %0
> >>>>>
> >>>>>     %5 = load i32, i32* %1, align 4
> >>>>>
> >>>>>     %6 = sub nsw i32 %5, 1
> >>>>>
> >>>>>     %7 = call i32 @fact(i32 %6)
> >>>>>
> >>>>>     %8 = load i32, i32* %1, align 4
> >>>>>
> >>>>>     %9 = mul nsw i32 %7, %8
> >>>>>
> >>>>>     br label %10
> >>>>>
> >>>>>
> >>>>>   ; <label>:10                                      ; preds = %0, %4
> >>>>>
> >>>>>     %11 = phi i32 [ %9, %4 ], [ 1, %0 ]
> >>>>>
> >>>>>     ret i32 %11
> >>>>>
> >>>>>   }
> >>>>>
> >>>>>
> >>>>> ...
> >>>>>
> >>>>> ---
> >>>>>
> >>>>> number:          0
> >>>>>
> >>>>> name:            fact
> >>>>>
> >>>>> alignment:       4
> >>>>>
> >>>>> regInfo:
> >>>>>
> >>>>>   ....
> >>>>>
> >>>>> frameInfo:
> >>>>>
> >>>>>   ....
> >>>>>
> >>>>> body:
> >>>>>
> >>>>>   - bb:              0
> >>>>>
> >>>>>     llbb:            '%0'
> >>>>>
> >>>>>     successors:      [ 'bb#2', 'bb#1' ]
> >>>>>
> >>>>>     liveIns:         [ '%edi' ]
> >>>>>
> >>>>>     instructions:
> >>>>>
> >>>>>       - 'push64r undef %rax, %rsp, %rsp'
> >>>>>
> >>>>>       - 'mov32mr %rsp, 1, %noreg, 4, %noreg, %edi'
> >>>>>
> >>>>>       - ....
> >>>>>
> >>>>>         ....
> >>>>>
> >>>>>   - bb:              1
> >>>>>
> >>>>>     llbb:            '%4'
> >>>>>
> >>>>>     successors:      [ 'bb#2' ]
> >>>>>
> >>>>>     instructions:
> >>>>>
> >>>>>       - '%edi = mov32rm %rsp, 1, %noreg, 4, %noreg'
> >>>>>
> >>>>>       - ....
> >>>>>
> >>>>>         ....
> >>>>>
> >>>>>   - ....
> >>>>>
> >>>>>     ....
> >>>>>
> >>>>> ...
> >>>>>
> >>>>>
> >>>>> The example above shows a YAML file with two YAML documents
> (delimited
> >>>>> by `---`
> >>>>>
> >>>>> and `...`) containing the LLVM IR and the machine function
> information
> >>>>> for the function `fact`.
> >>>>>
> >>>>>
> >>>>>
> >>>>> When a specific format is chosen, I'll start with patches that
> >>>>> serialize the
> >>>>>
> >>>>> embedded LLVM IR. Then I'll add support for things like machine
> >>>>> functions and
> >>>>>
> >>>>> machine basic blocks, and I think that an intrusive implementation
> will
> >>>>> work best
> >>>>>
> >>>>> for data structures like these. After that I will continue adding
> >>>>> support for
> >>>>>
> >>>>> serialization of the remaining data structures.
> >>>>>
> >>>>>
> >>>>>
> >>>>> Thanks for reading through the proposal. What are you thoughts about
> >>>>> this format?
> >>>>>
> >>>>>
> >>>>> _______________________________________________
> >>>>> LLVM Developers mailing list
> >>>>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> >>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> LLVM Developers mailing list
> >>>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> >>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> >>>>
> >>>
> >>>
> >>> _______________________________________________
> >>> LLVM Developers mailing list
> >>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> >>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> >>>
> >>
> >
> > _______________________________________________
> > LLVM Developers mailing list
> > LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> > http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
> >
> _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150429/a794804b/attachment.html>