[PATCH] D141508: [utils][filecheck-lint] Add filecheck-lint, a linter for FileCheck directives.

Wed Jan 11 10:44:50 PST 2023

sammccall added a comment.

Sorry, hit enter too soon on previous round.

Generally this looks really useful, and finds real bugs!

Most of the comments below are of the form:

- the implementation seems overly complex and general, I think being short/simple/direct is more useful for maintenance than being "good python code", as this is not production code and will be read/edited by people who don't know much python
- for my taste there's too much boilerplate comments repeating the names of things, and not enough explaining how this piece of code is contributing to solving the probelm

================
Comment at: llvm/utils/filecheck_lint/filecheck_lint.py:18
+  parser = argparse.ArgumentParser(description=__doc__)
+  parser.add_argument(
+      '--distance-threshold',
----------------
this isn't used anywhere - if there isn't a concrete need I'd suggest just making it a constant, then you can drop argparse and just iterate over argv?

(If you want to use a different config out of tree that's not unreasonable, but it's not clear to me why that would be the case)

================
Comment at: llvm/utils/filecheck_lint/filecheck_lint/filecheck_lint.py:8
+# ===----------------------------------------------------------------------===##
+"""A linter that detects potential typos in FileCheck directive names."""
+
----------------
there are a lot of fine-grained comments here about the implementation (e.g. what a class represents) but it's hard to get an overall picture of what this library does and how.

I think an example would go a long way here, somethng like:

```
Consider a broken test foo.cpp:

// RUN: clang -cc1 -ast-dump %s | FileCheck %s --check-prefix=NEW
// RUN: clang -cc1 -ast-dump %s -std=c++98 | FileCheck %s --check-prefix=OLD
auto x = 42;
// NEWW: auto is a c++11 extension
// ODL-NOT: auto is a c++11 extension

First we detect FileCheck directive prefixes by looking for --check-prefix flags.
Here we get {CHECK, NEW, OLD}, so our directive names are
{CHECK, NEW, OLD, CHECK-NOT, NEW-NOT, ...}.

Then we look for lines that look like directives: these are of the form FOO: at the beginning of a line or a comment.
If any of these are a "near-miss" for a directive name, then we suspect this is a typo.
```

================
Comment at: llvm/utils/filecheck_lint/filecheck_lint/filecheck_lint.py:39
+
+  Args:
+    s1: a string
----------------
If there's nothing non-obvious to say about an arg (the type hints already say they're strings), don't say anything? Similarly the "Returns:" just echoes the first line.

(I'm not a python person, maybe there's some structural reason this comment needs to be this way, but other files seem mostly not to have them)

================
Comment at: llvm/utils/filecheck_lint/filecheck_lint/filecheck_lint.py:62
+class FilePos:
+  """Stores the coordinates of a span on a single line within a file.
+
----------------
nit: FileRange? (both avoiding the abbreviation & the impression that it describes a point only)

================
Comment at: llvm/utils/filecheck_lint/filecheck_lint/filecheck_lint.py:83
+class Diagnostic:
+  """Stores information about typos and emit error diagnostics.
+
----------------
"emit" doesn't belong here I think.

Maybe more precisely: "Stores information about one typo and possibly its fix" (is the fix optional?)

I'm not sure the second paragraph or the attributes adds anything nonobvious (Findings isn't a concept that makes sense here)

================
Comment at: llvm/utils/filecheck_lint/filecheck_lint/filecheck_lint.py:113
+    return (
+        f'Found potentially misspelt directive "{self.typo}". Did you mean '
+        f'"{self.fix}"?'
----------------
nit: US english is much more common in LLVM, so probably "misspelled"

================
Comment at: llvm/utils/filecheck_lint/filecheck_lint/filecheck_lint.py:124
+  Finds all the strings that could be potential directives in a sequence of
+  strings. This assumes that lines containing the directive always start with
+  either whitespace, or the specified comment prefix. What constitutes a
----------------
We should also support the directive being at the start of a line-comment, even if the line is otherwise non-empty.

`rg "\s*[^ ]\s*// CHECK" clang/test` shows many examples e.g:
```
../clang/test/Refactor/LocalRename/NoSymbolSelectedError.cpp
4:class Baz { // CHECK: [[@LINE]]:1: error: there is no symbol at the given location         
```

================
Comment at: llvm/utils/filecheck_lint/filecheck_lint/filecheck_lint.py:131
+    lines: a sequence of strings
+    directive_prefix: an optional prefix associated with directives, e.g. '//'
+      or ';'. If not provided, the function does not attempt to match any prefix
----------------
this seems overly complicated: why not just always accept all comment markers in any type of file?

Eyeballing `"(?:^|#|//|;)\s*([A-Z0-9-_]+)\s*:"`, I don't see false positives.

================
Comment at: llvm/utils/filecheck_lint/filecheck_lint/filecheck_lint.py:140
+  matches = []
+  # TODO(bchetioui): regexes can be further improved to capture typoed
+  # directives more holistically. In practice, they seem to perform well as is.
----------------
it's not really clear what's to be done here: improved how? how can someone tell when it's time to delete this TODO?

================
Comment at: llvm/utils/filecheck_lint/filecheck_lint/filecheck_lint.py:143
+  if directive_prefix is None:
+    directive_pattern = re.compile(r'([\d\w\_][\s\d\w\-_]*):')
+  else:
----------------
I'd suggest restricting to uppercase as this is a ~universal convention and will avoid big classes of false positives, allowing to simplify elsewhere

================
Comment at: llvm/utils/filecheck_lint/filecheck_lint/filecheck_lint.py:148
+    )
+  for lineno, line in enumerate(lines, start=1):
+    match = re.search(directive_pattern, line)
----------------
up to you, but I'm not sure getting convenient access to the line number is worth doing everything linewise and looping.
You could make FilePos's constructor compute the line/column, then that class would do something useful!

(Not that it matters, but I suspect it also performs better in practice to run the regex over a big buffer, and actually needing the line number is the very rare error case)

================
Comment at: llvm/utils/filecheck_lint/filecheck_lint/filecheck_lint.py:187
+# TODO(bchetioui): check that the argument is actually part of a CHECK command.
+def parse_additional_prefixes(lines: Sequence[str]) -> Set[str]:
+  """Parses custom prefixes defined in the list of strings provided.
----------------
This seems like it could be much simpler: just take the whole file, match against a regex, and always split the captured arg on `,` since a prefix will never have one.

```
def parse_custom_prefixes(code): 
  for m in re.finditer(r'--check-prefix(?:es)?[ =](\S+)', code):
    for prefix in m.group(1).split(','):
      yield prefix
```

================
Comment at: llvm/utils/filecheck_lint/filecheck_lint/filecheck_lint.py:188
+def parse_additional_prefixes(lines: Sequence[str]) -> Set[str]:
+  """Parses custom prefixes defined in the list of strings provided.
+
----------------
this comment doesn't really say anything beyond what the function signature already says.

Maybe an example? (in the code: .... the prefixes are ...)

================
Comment at: llvm/utils/filecheck_lint/filecheck_lint/filecheck_lint.py:259
+    lines = f.readlines()
+  return find_directive_typos_impl(lines, filepath, comment_prefix, threshold)
+
----------------
why have two functions here instead of one? can't we just have the caller pass in the file content?

================
Comment at: llvm/utils/filecheck_lint/filecheck_lint/filecheck_lint.py:268
+) -> Sequence[Diagnostic]:
+  """Underlying implementation for `find_directive_typos`."""
+  all_prefixes = _prefixes.union(parse_additional_prefixes(lines))
----------------
this just echoes the function name

================
Comment at: llvm/utils/filecheck_lint/filecheck_lint/filecheck_lint.py:276
+      + list(_ignore)
+      + list(all_prefixes)
+  )
----------------
nit: make _suffixes = {"", "-NOT", "-DAG"} etc instead?

================
Comment at: llvm/utils/filecheck_lint/filecheck_lint/filecheck_lint.py:292
+  potential_directives = find_potential_directives(lines, comment_prefix)
+  diagnostics = []
+
----------------
up to you, many of these functions could be slightly terser with `yield` rather than create/append/return an array

================
Comment at: llvm/utils/filecheck_lint/filecheck_lint/filecheck_lint.py:312
+      continue
+    elif score <= threshold and best_match not in _ignore:
+      diagnostics.append(
----------------
this is a redundant second check for _ignore, you already added it to all_directives above

================
Comment at: llvm/utils/filecheck_lint/tests/filecheck_lint_test.py:1
+# ===----------------------------------------------------------------------===##
+#
----------------
this test isn't being run by anything.
you can leave it out (most utils aren't tested, this is already a tool for improving tests, recursion needs to stop at some point!)

But thinking about it more, I think it's harmless enough and probably helps when developing it.
There are a few other tools that use unittest.TestCase, AFAIK nothing runs those either.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D141508/new/

https://reviews.llvm.org/D141508