[llvm-dev] [RFC] Updating googletest to non-release tagged version

Tue Mar 20 13:30:45 PDT 2018

>> In my particular case from https://reviews.llvm.org/D44560, I currently
>> test the following 3 different cases across the full set of DWARF
>> versions and formats:
>>  - Parsing a valid line table
>>  - Emitting an error if the stated prologue length is greater than the
>>    actual length
>>  - Emitting an error if the stated prologue length is shorter than the
>>    actual length
>> The first is just testing the positive cases for each possible input. I
>> guess a single test for DWARF64 could be written for versions 2-4 and
>> another for version 5 (where there is more stuff between the two length
>> fields so this becomes interesting), similarly for DWARF32. To summarise,
>> I think that these cases are interesting: V5 + 32, V5 + 64, V2 + 32/64,
>> V3 + 32/64, V4 + 32/64. The biggest issue I have with cutting down to
>> just this set is that it makes the tests less specific, e.g. at a glance,
>> what is important about the test - the fact that it is v4, or DWARF64,
>> or both independently, or the combination?
>>
>> Related aside: I've realised since earlier that there is scope for
>> version 2 tests, distinct from version 3: v2 tests test the lower
>> boundary on valid versions, and v3 the upper boundary on versions
>> without maximum_operations_per_instruction.
>>
>> The latter two test cases are important because a) the length field has
>> a different size for DWARF32/64 and therefore the prologue length needs
>> to be measured from a different point between the different formats, and
>> b) the contents of the prologue are different in each of version 3, 4,
>> and 5, and thus the amount read will be different. We could test each
												|
>> individual version, independently of the format, but it is theoretically
>> possible for an error to sneak in whereby the two different failure
>> points cancel each other out. The benefit is admittedly small, but these
>> tests are fast, so I don't think it hurts to have them.
>
> Still not sure I follow all of this - though perhaps the test design
> discussion might be better in its own thread. But the broad subject/topic/
> name of the stuff I'm interested in applying here is equivalence
> partitioning: https://en.wikipedia.org/wiki/Equivalence_partitioning -
> helps explain the general idea of reducing "test all combinations/
> permutations of cases" to "test representative cases from each class".
>
> - Dave

Equivalence classes are a fine way to reduce duplication in manually
written tests.  They tend to depend on a white-box understanding of the
system-under-test to define the equivalences reasonably.  LLVM regression
tests tend to be written very much white-box, as it's the developer who
is writing the test. :-)  The problem there is that as the code evolves,
the original equivalence classes may change, but that doesn't necessarily
make any tests fail, it just ends up with reduced coverage.  Without good
coverage analysis we can easily find tests no longer verifying everything
we wanted them to.

(And in fact back at HP I had an interesting chat with a product owner
who was very much a fan of combinatoric testing; he said every single
time he added a new dimension, he found new bugs, and didn't think he
could have predicted which instances would have found the bugs.)

That said, in the case at hand I think looking at the testing at the
granularity of "parsing a line table" is too coarse.  There are several
different bits to it and we can reasonably anticipate that they will
remain relatively independent bits:
- detecting the 32/64 format
- extracting the version-dependent sets of fields accurately
- detecting mismatches between the stated length and the internally
  consistent length (e.g., the header says length X, but the file
  table has N entries which runs it to length > X).

To the extent that new DWARF versions have different fields, we should
test that the new fields are extracted and reported (the success cases).
If the new fields are 32/64-sensitive, those should be tested in both 
formats to make sure that sensitivity is handled (more success cases).
The notion of running past the stated length (conversely, finishing the
parsing before reaching the stated length) intuitively isn't either
version-sensitive or format-sensitive, i.e. all those cases look like
they should be part of the same equivalence class, so you don't need
to replicate those across all versions and formats.  With the caveat
that if white-box examination shows that some case takes a very
different path, it's worth replicating the test.  In particular the
v5 directory/file tables are very different from prior versions, so
it's worth having separate bad-length tests for pre-v5 and v5.

--paulr