[llvm-dev] [RFC] Implementing LLVM MC Protobuf Fuzzer for Assembly and Encoding for RISC-V target

Thu Dec 20 13:14:44 PST 2018

Hi Daniel,

Thanks for the feedback.

Sorry for the delay in responding, other tasks got in the way.

I was able to experiment a bit with your suggestion. The WIP is here:

https://reviews.llvm.org/D55644 Emitting MCInstruction from Protocol 
Buffer Message.

I modified the example fuzzer and the “unconstrained fuzzer” to dump 
both the MCInst and assembly instruction based on
the operand types defined in the Protocol Buffer messages.

Just to recap, the “unconstrained fuzzer” means valid instruction 
operand types are defined in the Protocol Buffer messages –
they can be register, immediate, immediate + register pair etc.
Valid instruction opcodes are also defined in the Protocol Buffer 
messages.
The fuzzer then generates an assembly statement as a combo of any opcode 
and operands types.
This is the version of the fuzzer that proved to be useful in finding 
bugs.
It can generate both valid and invalid instructions like “add”, “add 
x0”, add “x1, x2, x3”, add “f1, x2, 10”, etc.

Here are some observations from modifying the assembler fuzzer to emit 
both assembly and MCInst:

1) Emitting MC instructions requires access to a backend’s MC-layer 
internal values.
For example, we need the symbolic names and enum values for instruction 
opcodes and registers.
So we depend on *.td files (e.g., RISCVGenInstrInfo, 
RISCVGenRegisterInfo) to print the MCInst opcode and operand values 
correctly.
Example:
   add x8
# <MCInst #168 ADD
#  <MCOperand Reg:9>>

For the assembly and disassembler fuzzers we created, we only relied on 
the ISA manual to create the Protocol Buffer messages.
But if we now additionally emit MCInst, we will need to add that 
dependence.

2) It seems it is possible to emit MCInst operands based on the protocol 
buffer messages we defined for valid operand types.
However, at the time we emit an assembly instruction from the protocol 
buffer messages, we don’t know yet if the instruction is
valid and will be parsed, but we still print the expected MCInst 
operands.
For example: see the invalid add with only one operand “add x8” above. 
We still print the MCInst even though it is not a
valid instruction.
The post-processing tool that will compare MCInst generated by the 
fuzzer with what the assembler outputs,
will have to handle/discard this situation.

3) Need to add more protocol messages to encode more info needed by the 
fuzzer that emits MCInst.
For example, it is common that FP register name are the same (F1, F2, 
etc) in SP and DP instructions (e.g, fadd.s, fadd.d),
but they have different MC symbolic names and values (F1_32, F1_64, 
F2_32, F2_64, etc).
The proto-to-asm converter does not know if it is generating a 32-bit, 
64-bit, SP or DP assembly instruction,
it does not need to care about that.
But the proto-to-MCInstr converter needs that kind info. However it does 
not have any context since it prints operand by operand,
it does not know which opcode was processed before, or which target 
triple or ISA extension was intended.
Issues like this probably can be addressed by adding/changing the 
protocol messages
(e.g., Allow register operand type to be one of FP or Integer type, 
allow FP register be one of SP or DP type.
The proto-to-asm converter only needs the first info, the 
proto-to-MCInst converter needs both info).

There a more scenarios like (3) that I came across when trying to emit 
MCInsts from the assembler fuzzer. It seems they have solutions tough.

So, in principle, you can add proto-to-asm, proto-to-MCInst (and 
proto-to-encoding) converters based on the Protocol Messages.

The only downside I see is that the Protocol Messages definitions will 
depend on a backend's MC layer details (to print MCInsts correctly),
while before we defined the Protocol Buffer Messages based on the ISA 
manual only.

I have not tried to do the same with the disassembler fuzzer, because in 
that one we just generate random 32 bit numbers, varying some non-fixed 
fields  in the number that represent operands and opcodes.

Let me know what you think.

Thanks,
Ana.

On 2018-10-24 18:24, Daniel Sanders wrote:
> Hi Ana,
> 
>> On Oct 24, 2018, at 13:05, apazos at codeaurora.org wrote:
>> 
>> Hi Daniel,
>> 
>> Thanks for the feedback.
>> 
>> That is correct, you can invoke the fuzzers without a golden
>> reference implementation. The driver program to compare behaviors is
>> just a convenient tool for those who have a reference
>> implementation.
>> 
>> I am not sure I understood your suggestion about de-serializing
>> Protobuf messages as MCInst objects. Can you clarify?
> 
> I'm thinking that a single protobuf message could be used to generate
> the Assembly, MCInst, and Encoding for a given instruction. From that
> you can:
> 
> 	* Parse the generated assembly and check the intermediate MCInst
> matches the generated MCInst
> 	* Disassemble the generated encoding and check the intermediate
> MCInst matches the generated MCInst
> 	* Assemble the generated MCInst and check the encoding matches the
> generated encoding
> 	* Disassemble the generated MCInst and check the assembly matches the
> generated assembly
> 
> in addition to testing the Assembly->Encoding and Encoding->Assembly.
> With these additional tests, the fuzzer can identify whether it's the
> parsing, assembling, disassembling, or instruction printing that is
> incorrect.
> 
> For example, given 'addiu $1, $2, 3', 'ADD reg(1), reg(2), imm(3)',
> and 0x0123 the fuzzer would be able to report something like:
>  0x0123 didn't match golden reference (0x0123 != 0x0126)
>  Parsing  'addiu $1, $2, 3' produces 'ADD reg(1), reg(2), imm(3)':
> Parser is ok
>  Assembling 'ADD reg(1), reg(2), imm(3)' produces '0x0126': Assembler
> is wrong
>  Disassembling '0x0123' produces 'ADD reg(1), reg(2), imm(3)' produces
> '0x0124': Disassembler is ok
>  Printing 'ADD reg(1), reg(2), imm(3)' produces 'addiu $1, $2, 3':
> Instruction printer is ok
> 
> I'm also thinking that fuzzing the MCInst->Assembly, and
> MCInst->Encoding paths in isolation would allow us to be sure that the
> MCInst's that are emitted by CodeGen behave correctly when given to
> the MC layer. At the moment, we assume that Assembly->MCInst->Encoding
> and Encoding->MCInst->Assembly being correct implies that
> CodeGen->MCInst->Assembly and CodeGen->MCInst->Encoding are correct
> too. However, this doesn't quite hold as there are multiple ways of
> expressing the same operand in an MCInst. For example:
> 
> 	* addExpr(MCConstantExpr::create(2, Ctx))
> 	* addImm(2)
> 	* addExpr(/* some REL relocation evaluating to Symbol+2 */)
> 
> and Assembly->MCInst->Encoding/Encoding->MCInst->Assembly will only
> exercise one of them.
> 
>> Thanks,
>> Ana.
>> 
>> On 2018-10-16 10:34, Daniel Sanders wrote:
>> On 16 Oct 2018, at 10:09, Daniel Sanders via llvm-dev
>> <llvm-dev at lists.llvm.org> wrote:
>> Hi Ana,
>> I think this looks interesting although unfortunately I'm not sure
>> I'm going to be able to make use of it for my current target as I
>> don't have a golden reference tool available.
>> Thinking about it a bit more, the lack of a golden reference only
>> really affects my ability to use the driver script. With a
>> different/modified driver I should be able to use the underlying
>> fuzzer without a reference tool available.
>> One of the key weaknesses of llvm-mc-disassembler-fuzzer for most
>> targets is that it only finds a corpus of tests that improve
>> coverage but doesn't provide any assessment on what the correct
>> behaviour is. A human is required to make proper test cases out of
>> the corpus and feed it back in so the fuzzer can drop the
>> corresponding generated tests. Having a fuzzer that can verify the
>> behaviour as well would be very useful for targets with access to a
>> golden reference tool.
>> One thing that occurred to me while skimming through D51144 was that
>> something similar to proto_to_asm_main.cpp could be used to generate
>> MCInst objects directly from the same protobuf. This would allow you
>> to attribute bugs to the parser, instruction printer, or object
>> emitter since you'd be able to tell, for example, that the parser
>> emitted the an MCInst that matched the one expected by the protobuf.
>> On 15 Oct 2018, at 12:29, via llvm-dev <llvm-dev at lists.llvm.org>
>> wrote:
>> Hello,
>> We have implemented LLVM Machine Code Protobuf fuzzers for the
>> RISC-V target as part of a Summer internship project with our intern
>> Jocelyn Wei.
>> The fuzzers for the assembler and disassembler proved to be useful.
>> We uncovered bugs and detected compatibility issues with other
>> tools, e.g., by running a driver program that implements a round
>> trip with a golden (i.e., more tested) tool such as GNU AS.
>> We built different fuzzer versions to experiment with the level of
>> fuzzing for the instruction operands.
>> The versions are labeled sample, semi-constrained, unconstrained. We
>> fix opcodes, and depending on the fuzzer version, allow number of
>> operands, operand value ranges, and operand types to vary.
>> The code is available for review:
>> https://reviews.llvm.org/D51710 Implemented Protobuf fuzzer for LLVM
>> RISC-V MC Disassembler
>> https://reviews.llvm.org/D51144 Implemented Protobuf fuzzer for LLVM
>> RISC-V MC Assembler
>> We would like to assess people's interest in adding this type of
>> tool to the LLVM code base.
>> It can be further improved for RISC-V target and also expanded to
>> other targets.
>> We have a Poster about the fuzzers at the LLVM Dev Conf this week.
>> Please visit our poster and come by with your comments and
>> suggestions. We appreciate your feebdack.
>> Thank you,
>> Ana.
>> --
>> Ana Pazos
>> Qualcomm Innovation Center, Inc.
>> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
>> a Linux Foundation Collaborative Project.
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
>> _______________________________________________
>> LLVM Developers mailing list
>> llvm-dev at lists.llvm.org
>> http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev
> 
> --
> Ana Pazos
> Qualcomm Innovation Center, Inc.
> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
> a Linux Foundation Collaborative Project.

-- 
Ana Pazos
Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project.