[PATCH] D141508: [utils][filecheck-lint] Add filecheck-lint, a linter for FileCheck directives.

Thu Jan 12 02:47:52 PST 2023

bchetioui added a comment.

Thanks for the thorough review, Sam, I think this is much better!

I have addressed your comments in my latest revision. Let me know if something is still amiss.

================
Comment at: llvm/utils/filecheck_lint/README.md:16
+filecheck_lint path/to/test/file/1 ... path/to/test/file/n
+# With a custom edit distance reporting threshold (default: 3)
+filecheck_lint --threshold 4 path/to/test/file/1 ... path/to/test/file/n
----------------
sammccall wrote:
> I don't see any reason users should care about customizing this - we should set a sensible default, and the set of directives doesn't vary wildly between tests.
> 
> Fine if you want to have it for testing or some external use, but I wouldn't document it - it's distracting.
I removed the option and removed this line from the documentation.

================
Comment at: llvm/utils/filecheck_lint/filecheck_lint/filecheck_lint.py:8
+# ===----------------------------------------------------------------------===##
+"""A linter that detects potential typos in FileCheck directive names."""
+
----------------
sammccall wrote:
> there are a lot of fine-grained comments here about the implementation (e.g. what a class represents) but it's hard to get an overall picture of what this library does and how.
> 
> I think an example would go a long way here, somethng like:
> 
> ```
> Consider a broken test foo.cpp:
> 
> // RUN: clang -cc1 -ast-dump %s | FileCheck %s --check-prefix=NEW
> // RUN: clang -cc1 -ast-dump %s -std=c++98 | FileCheck %s --check-prefix=OLD
> auto x = 42;
> // NEWW: auto is a c++11 extension
> // ODL-NOT: auto is a c++11 extension
> 
> First we detect FileCheck directive prefixes by looking for --check-prefix flags.
> Here we get {CHECK, NEW, OLD}, so our directive names are
> {CHECK, NEW, OLD, CHECK-NOT, NEW-NOT, ...}.
> 
> Then we look for lines that look like directives: these are of the form FOO: at the beginning of a line or a comment.
> If any of these are a "near-miss" for a directive name, then we suspect this is a typo.
> ```
Thanks, that's indeed much better.
I added a slight variation of your comment there, hope that's OK.

================
Comment at: llvm/utils/filecheck_lint/filecheck_lint/filecheck_lint.py:39
+
+  Args:
+    s1: a string
----------------
sammccall wrote:
> If there's nothing non-obvious to say about an arg (the type hints already say they're strings), don't say anything? Similarly the "Returns:" just echoes the first line.
> 
> (I'm not a python person, maybe there's some structural reason this comment needs to be this way, but other files seem mostly not to have them)
The structure was meant to satisfy an overly eager linter.
That is not a good reason, and I agree it is not particularly useful. Removed.

================
Comment at: llvm/utils/filecheck_lint/filecheck_lint/filecheck_lint.py:83
+class Diagnostic:
+  """Stores information about typos and emit error diagnostics.
+
----------------
sammccall wrote:
> "emit" doesn't belong here I think.
> 
> Maybe more precisely: "Stores information about one typo and possibly its fix" (is the fix optional?)
> 
> I'm not sure the second paragraph or the attributes adds anything nonobvious (Findings isn't a concept that makes sense here)
Done. (The fix is not optional.)

================
Comment at: llvm/utils/filecheck_lint/filecheck_lint/filecheck_lint.py:124
+  Finds all the strings that could be potential directives in a sequence of
+  strings. This assumes that lines containing the directive always start with
+  either whitespace, or the specified comment prefix. What constitutes a
----------------
sammccall wrote:
> We should also support the directive being at the start of a line-comment, even if the line is otherwise non-empty.
> 
> `rg "\s*[^ ]\s*// CHECK" clang/test` shows many examples e.g:
> ```
> ../clang/test/Refactor/LocalRename/NoSymbolSelectedError.cpp
> 4:class Baz { // CHECK: [[@LINE]]:1: error: there is no symbol at the given location         
> ```
Oops, in fact, this is a case of the documentation being out of date with regards to the code.
In fact, the very case that you are outlining is tested in `test_find_potential_directives_comment_prefix` and works properly.
Corrected the documentation to reflect that, thanks!

================
Comment at: llvm/utils/filecheck_lint/filecheck_lint/filecheck_lint.py:131
+    lines: a sequence of strings
+    directive_prefix: an optional prefix associated with directives, e.g. '//'
+      or ';'. If not provided, the function does not attempt to match any prefix
----------------
sammccall wrote:
> this seems overly complicated: why not just always accept all comment markers in any type of file?
> 
> Eyeballing `"(?:^|#|//|;)\s*([A-Z0-9-_]+)\s*:"`, I don't see false positives.
A comment marker is not strictly necessary for something to be parsed as a directive by FileCheck---meaning I'd like to keep the option for directive prefix to be optional. But if it is always optional, then I am afraid we may see a host of false positives, since anything preceding a colon would always be matched for every type of file.

I personally prefer it like this, but if you'd rather, I could change "directive_prefix" to a boolean "match_comment_marker". When enabled, I would match any comment marker. WDYT?

Apart from the comment markers, I think the regex you suggest will not yield false positives, but that it may allow more "false negatives" to pass through. E.g., in https://github.com/tensorflow/tensorflow/commit/48cacf049f3d6ed3f289ccc48ec50491b6d8d9a8, one of typos contained lowercase letters, and another contained a space instead of a dash. These would not be caught by the regex you suggest, so I would like to keep the regex as is. WDYT?

================
Comment at: llvm/utils/filecheck_lint/filecheck_lint/filecheck_lint.py:140
+  matches = []
+  # TODO(bchetioui): regexes can be further improved to capture typoed
+  # directives more holistically. In practice, they seem to perform well as is.
----------------
sammccall wrote:
> it's not really clear what's to be done here: improved how? how can someone tell when it's time to delete this TODO?
I removed this unclear TODO. One improvement I was thinking about (but that does not actually concern the regex directly) is that something like "// some inline comment CHCK:" gives us a match with "some inline comment CHCK:" right now. We should probably also match "inline comment CHCK:", "comment CHCK:", and "CHCK:", since FileCheck may match either of those.

While more correct, I haven't yet validated that this is fine wrt false positives, so I have chosen to omit this from the initial version of the code.

================
Comment at: llvm/utils/filecheck_lint/filecheck_lint/filecheck_lint.py:143
+  if directive_prefix is None:
+    directive_pattern = re.compile(r'([\d\w\_][\s\d\w\-_]*):')
+  else:
----------------
sammccall wrote:
> I'd suggest restricting to uppercase as this is a ~universal convention and will avoid big classes of false positives, allowing to simplify elsewhere
ACK, but I don't think restricting to uppercase will yield better results, as explained above in the "directive_prefix" comment.

By the way, I have run the linter over e.g. all the .ll test files in the LLVM repo, and I do not recall matching false positives related to case. (Either way, the number of false positives has actually been surprisingly low.)

================
Comment at: llvm/utils/filecheck_lint/filecheck_lint/filecheck_lint.py:148
+    )
+  for lineno, line in enumerate(lines, start=1):
+    match = re.search(directive_pattern, line)
----------------
sammccall wrote:
> up to you, but I'm not sure getting convenient access to the line number is worth doing everything linewise and looping.
> You could make FilePos's constructor compute the line/column, then that class would do something useful!
> 
> (Not that it matters, but I suspect it also performs better in practice to run the regex over a big buffer, and actually needing the line number is the very rare error case)
I think you are right re. performance, and this is cleaner. Made the changes are you suggest, thanks!

================
Comment at: llvm/utils/filecheck_lint/filecheck_lint/filecheck_lint.py:187
+# TODO(bchetioui): check that the argument is actually part of a CHECK command.
+def parse_additional_prefixes(lines: Sequence[str]) -> Set[str]:
+  """Parses custom prefixes defined in the list of strings provided.
----------------
sammccall wrote:
> This seems like it could be much simpler: just take the whole file, match against a regex, and always split the captured arg on `,` since a prefix will never have one.
> 
> ```
> def parse_custom_prefixes(code): 
>   for m in re.finditer(r'--check-prefix(?:es)?[ =](\S+)', code):
>     for prefix in m.group(1).split(','):
>       yield prefix
> ```
Indeed, that's much better. I modified the regex slightly to match also other types of valid flag-settings that occur.
(That was actually one of the TODOs above, should have just done it :-)).

================
Comment at: llvm/utils/filecheck_lint/filecheck_lint/filecheck_lint.py:292
+  potential_directives = find_potential_directives(lines, comment_prefix)
+  diagnostics = []
+
----------------
sammccall wrote:
> up to you, many of these functions could be slightly terser with `yield` rather than create/append/return an array
Thanks, I made `find_potential_directives` and `find_directive_typos` generators, as you suggest.

================
Comment at: llvm/utils/filecheck_lint/filecheck_lint/filecheck_lint.py:312
+      continue
+    elif score <= threshold and best_match not in _ignore:
+      diagnostics.append(
----------------
sammccall wrote:
> this is a redundant second check for _ignore, you already added it to all_directives above
The purpose of this check is to not report on detected typos when the likely intended directive is one we would ignore anyway. 
This is particularly useful if someone uses very short custom CHECK prefixes (e.g. given a check prefix "DO", "ToDo" would be close enough to "DO" that it could be a typo, but is actually closer to "TODO". Reporting it would likely be a false positive).

I suggest we leave it as is, WDYT?

================
Comment at: llvm/utils/filecheck_lint/tests/filecheck_lint_test.py:1
+# ===----------------------------------------------------------------------===##
+#
----------------
sammccall wrote:
> this test isn't being run by anything.
> you can leave it out (most utils aren't tested, this is already a tool for improving tests, recursion needs to stop at some point!)
> 
> But thinking about it more, I think it's harmless enough and probably helps when developing it.
> There are a few other tools that use unittest.TestCase, AFAIK nothing runs those either.
If you think this is harmless enough, I would like to keep some tests next to the tool. (If not, I will keep them on my end for personal use, but that seems worse.)
They don't need to be run automatically, but I do find them helpful for development.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D141508/new/

https://reviews.llvm.org/D141508