[llvm-dev] [RFC] Compiled regression tests.

Wed Jul 1 10:24:33 PDT 2020

What I actually meant re. CHECK-DAG is to take this

    72 ; CHECK: ![[ACCESS_GROUP_LIST_3]] = !{![[ACCESS_GROUP_INNER:[0-9]+]], ![[ACCESS_GROUP_OUTER:[0-9]+]]}
    73 ; CHECK: ![[ACCESS_GROUP_INNER]] = distinct !{}
    74 ; CHECK: ![[ACCESS_GROUP_OUTER]] = distinct !{}

and turn it into this

    72 ; CHECK: ![[ACCESS_GROUP_LIST_3]] = !{![[ACCESS_GROUP_INNER:[0-9]+]], ![[ACCESS_GROUP_OUTER:[0-9]+]]}
    73 ; CHECK-DAG: ![[ACCESS_GROUP_INNER]] = distinct !{}
    74 ; CHECK-DAG: ![[ACCESS_GROUP_OUTER]] = distinct !{}

except that I think I was misled by the “non-semantic” remark about the change.  Did you mean that the order of the INNER and OUTER elements (line 72) has been swapped?  That sounds semantic, as far as the structure of the metadata is concerned!  But okay, let’s call that a syntactic change, and a test relying on the order of the parameters will break.  Which it did.  And the correct fix is instead

    72 ; CHECK: ![[ACCESS_GROUP_LIST_3]] = !{![[ACCESS_GROUP_OUTER:[0-9]+]], ![[ACCESS_GROUP_INNER:[0-9]+]]}

is it not?  To reflect the change in order?

But let’s say I’m the one doing this presumably innocuous change, and have no clue what I’m doing, and don’t know much about how FileCheck works (which is pretty typical of the community, I admit).  You’ve shown issues with trying to diagnose the FileCheck results.

How would the compiled regression test fail?  How would it be easier to identify and repair the issue?

Re. how MIR makes testing easier: it is a serialization of the data that a machine-IR pass operates on, which makes it feasible to feed canned MIR through a single pass in llc and look at what exactly that pass did.  Prior to MIR, we had to go from IR to real machine code and infer what was going on in a pass after multiple levels of transformation had occurred.  It was very black-box, and the black box was way too big.

Anyway, getting back to the original topic, it sounds like you’re proposing two rather independent things: (1) an easier API for writing optimizer unittests, (2) a way to push unittests into the main llvm/test tree.  Can we separate these things?
Thanks,
--paulr

From: Michael Kruse <llvmdev at meinersbur.de>
Sent: Wednesday, July 1, 2020 12:33 PM
To: Robinson, Paul <paul.robinson at sony.com>
Cc: Michael Kruse <llvmdev at meinersbur.de>; llvm-dev at lists.llvm.org
Subject: Re: [llvm-dev] [RFC] Compiled regression tests.

Am Mi., 1. Juli 2020 um 10:18 Uhr schrieb Robinson, Paul <paul.robinson at sony.com<mailto:paul.robinson at sony.com>>:
The test as written is fragile because it requires a certain ordering.  If the output order is not important, use CHECK-DAG rather than CHECK.  This would be a failure to understand the testing tool.

CHECK-DAG does not help here since what changes is within a list on the same line, and we have no CHECK-SAME-DAG or CHECK-DAG-SAME. Even if we had it, the actual line that changed is textually the same and FileCheck would need to backtrack deep into the following lines for alternative placeholder substitutions. It would look like

CHECK-SAME-DAG: ![[ACCESS_GROUP_INNER:[0-9]+]]
CHECK-SAME-DAG: ,
CHECK-SAME-DAG: ![[ACCESS_GROUP_OUTER:[0-9]+]]

which allows the comma to appear anywhere and I don't find readable.

My (naive?) conclusion is that textural checking is not the right tool here.

My experience, over a 40-year career, is that good software developers are generally not very good test-writers.  These are different skills and good testing is frequently not taught.  It’s easy to write fragile tests; you make your patch, you see what comes out, and you write the test to expect exactly that output, using the minimum possible features of the testing tool.  This is poor technique all around.  We even have scripts that automatically generate such tests, used primarily in codegen tests.  I devoutly hope that the people who produce those tests responsibly eyeball all those cases.

The proposal appears to be to migrate output-based tests (using ever-more-complicated FileCheck features) to executable tests, which makes it more like the software development people are used to instead of test-writing.  But I don’t see that necessarily solving the problem; seems like it would be just as easy to write a program that doesn’t do the right thing as to write a FileCheck test that doesn’t do the right thing.

IMHO having a tool that allows to better express what is intended to be tested is already worth a lot. For instance, we usually don't care about SSA value names or MDNode numbers, but we have to put extra work into regex-ing away those names in FileCheck tests and as a result, most tests we have do still expect the exact number for metadata nodes. This is a problem if we we want to emit new metadata nodes in that all those tests need to be updated.

This problem goes away if the test method by default ignored value names/MDNode numbers and software development people had to put extra work if they actually want to verify this.

Hal’s suggestion is more to the point:  If the output we’re generating is not appropriate to the kinds of tests we want to perform, it can be worthwhile to generate different kinds of output.  MIR is a case in point; for a long time it was hard to introspect into the interval between IR and final machine code, but now it’s a lot easier.

Can you elaborate on what makes it easier?

Michael

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20200701/6e2c5014/attachment-0001.html>