[PATCH] D141508: [utils][filecheck-lint] Add filecheck-lint, a linter for FileCheck directives.

Thu Jan 12 03:45:25 PST 2023

sammccall added a comment.

Thanks! This all looks pretty good to me apart from the complexity around comment markers, detecting which comment markers to use for which language etc.
I'd really like if you could either:

- decide we don't care about comment markers, and drop them entirely
- decide we do care, and hardcode (any comment marker|start-of-line), dropping the optional-ness and language detection
- explain why it's important that this feature is sometimes on/sometimes off, we need to know exactly the per-language comment marker etc

================
Comment at: llvm/utils/filecheck_lint/filecheck_lint/filecheck_lint.py:131
+    lines: a sequence of strings
+    directive_prefix: an optional prefix associated with directives, e.g. '//'
+      or ';'. If not provided, the function does not attempt to match any prefix
----------------
bchetioui wrote:
> sammccall wrote:
> > this seems overly complicated: why not just always accept all comment markers in any type of file?
> > 
> > Eyeballing `"(?:^|#|//|;)\s*([A-Z0-9-_]+)\s*:"`, I don't see false positives.
> A comment marker is not strictly necessary for something to be parsed as a directive by FileCheck---meaning I'd like to keep the option for directive prefix to be optional. But if it is always optional, then I am afraid we may see a host of false positives, since anything preceding a colon would always be matched for every type of file.
> 
> I personally prefer it like this, but if you'd rather, I could change "directive_prefix" to a boolean "match_comment_marker". When enabled, I would match any comment marker. WDYT?
> 
> Apart from the comment markers, I think the regex you suggest will not yield false positives, but that it may allow more "false negatives" to pass through. E.g., in https://github.com/tensorflow/tensorflow/commit/48cacf049f3d6ed3f289ccc48ec50491b6d8d9a8, one of typos contained lowercase letters, and another contained a space instead of a dash. These would not be caught by the regex you suggest, so I would like to keep the regex as is. WDYT?
> A comment marker is not strictly necessary for something to be parsed as a directive by FileCheck

Sure, but you're using heuristics, and have to make some decisions. If **all** filters are loose (don't require caps, allow spaces, don't require it to be at the start of a line/comment) then you're going to have lots of false positives.

In the code the comment marker is not optional: you always pass one at the callsite. So the code handling `directive_prefix is None` is dead & I think we should delete it. Alternatively, if you want to allow directives to appear anywhere,  delete the case that handles `directive_prefix`. Unless it's really important, we don't need to define an API that lets you turn this on or off.

Sure, that's why we would have `^` in the alternation: it will match at the beginning of the line (I guess you have to set re.MULTILINE in python for this)

> Apart from the comment markers, I think the regex you suggest will not yield false positives, but that it may allow more "false negatives" to pass through

Fair enough. Allowing spaces *and* lowercase letters seems very permissive. Matching too many things may not be a problem per se, though it could lead to missing the right match (e.g. `blah CHEC: something` won't match "CHEC" because it matches "blah CHEC"). But the details are up to you.

> I would like to keep the regex as is.

Sure, the details of the regex don't bother me much as simplifying the code a bit - regardless of which characters you choose to match, 

================
Comment at: llvm/utils/filecheck_lint/filecheck_lint/filecheck_lint.py:312
+      continue
+    elif score <= threshold and best_match not in _ignore:
+      diagnostics.append(
----------------
bchetioui wrote:
> sammccall wrote:
> > this is a redundant second check for _ignore, you already added it to all_directives above
> The purpose of this check is to not report on detected typos when the likely intended directive is one we would ignore anyway. 
> This is particularly useful if someone uses very short custom CHECK prefixes (e.g. given a check prefix "DO", "ToDo" would be close enough to "DO" that it could be a typo, but is actually closer to "TODO". Reporting it would likely be a false positive).
> 
> I suggest we leave it as is, WDYT?
Ah, got it - sure.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D141508/new/

https://reviews.llvm.org/D141508