[PATCH] D86671: [clang-tidy] Add new case type to check variables with Hungarian notation

Thu Sep 24 10:06:30 PDT 2020

aaron.ballman added a comment.

In D86671#2284078 <https://reviews.llvm.org/D86671#2284078>, @dougpuob wrote:

> Hi @aaron.ballman
> About changing  `size_t nLength` to `cbLength`. I searched MSVC folders with `size_t`, many names of variable start with `n`, or `i` in MFC related files. So I prefer to keep it starts with `n`. Another side to name starts with `cb`, I found variables like cbXxx are defined with ULONG, DWORD, or UCHAR type.

I think the important point is that `cb` is used for APIs that specify a count of bytes, whereas `i`, `n`, and `l` are used for integers (there is no specific prefix for `size_t` that I'm aware of or that MSDN documents, so the lack of consistency there is not too surprising). That's why I mentioned that using a heuristic approach may allow you to identify the parameters that are intended to be a count of bytes so that you can use the more specific prefix if there's more specific contextual information available.

================
Comment at: clang-tools-extra/clang-tidy/readability/IdentifierNamingCheck.cpp:236
+      PrefixStr = "fn"; // Function Pointer
+    } else if (QT->isPointerType()) {
+      // clang-format off
----------------
dougpuob wrote:
> aaron.ballman wrote:
> > I'm not certain how valid it is to look at just the type and decide that it's a null-terminated string. For instance, the following is not an uncommon pattern: `void something(const char *buffer, size_t length);` and it would be a bit strange to change that into `szBuffer` as that would give an indication that the buffer is null terminated. You could look at surrounding code for additional information though, like other parameters in a function declaration. As an aside, this sort of heuristic search may also allow you to change `length` into `cbLength` instead of `nLength` for conventions like the Microsoft one.
> > 
> > However, for local variables, I don't see how you could even come up with a heuristic like you could with parameters, so I'm not certain what the right answer is here.
> OK (size_t nLength --> cbLength)
> 
> About the `void something(const char *buffer, size_t length)` pattern. `char` is a special type which can express signed char and unsigned char.  One can express C string(ASCII), another expresses general data buffer(in hex). I think you are mentioning when it is a general data buffer. I agree with you if it changed the name from `buffer` to `szBuffer` for general data buffer is strange. I prefer to use `uint8_t` or `unsigned char` instead.
> 
> How about adding a new config option for it? like, `  - { key: readability-identifier-naming.HungarainNotation.DontChangeCharPointer   , value: 1 }` Users can make decision for their projects. For consistency with HN, the default will still be changed to `szBuffer` in your case.
> 
> If I add the option, does it make sense to you from your side?
I'm not certain that approach would make sense because the decision needs to be made on a case-by-case basis. Taking the C standard library as an example set of APIs, a function like `char *strncpy(char * restrict s1, const char * restrict s2, size_t n);` is one where this functionality should not change `s1` or `s2` to be `szS1` or `szS2` because the `sz` modifier indicates a null-terminated string, but these strings do not have to be null terminated. However, an API like `char *strncat(char * restrict s1, const char * restrict s2, size_t n);` should change `s1` to be `szS1` because that string is required to be null terminated. In both cases, the declarations use `char`.

`uint8_t` and friends are newer datatypes (added in C99) and are not required to be present (so they're not portable to all implementations), which may account for buffer-based APIs still using `char`.

I think adding some heuristics to the check may help some of these situations pick the correct prefix, but I'm worried that cases like the two above are going to be almost intractable for the check. I don't have a good idea of how prevalent that kind of issue will be, though. One thing that would help me feel more comfortable would be to look at some data about how the check performs on some various code bases. e.g., pick some large or popular libraries that use C, C++, or both (Win32 SDK, LLVM headers, OpenSSL, etc), run the check over it, and check the results to see how often the prefix is right compared to how often it's wrong. If we find it's mostly right and only gets it wrong a small percentage of the time, then we've likely found a good compromise for the check's behavior. If we find it gets it wrong too often, then the kinds of tweaks needed may be more obvious.

Btw, the reason I think it's important to be careful about false positives for this particular check is because Hungarian notation is intended to convey semantic information in some cases (like whether a buffer is null terminated or not) and that can have security implications. This is the sort of check where it's easy for a person to change all the identifiers in their program assuming they're correct and only find out about the issue much later when someone else gets the semantics wrong because of an incorrect prefix.

================
Comment at: clang-tools-extra/clang-tidy/readability/IdentifierNamingCheck.cpp:319
+std::string
+IdentifierNamingCheck::getDeclTypeName(const clang::NamedDecl *Decl) const {
+  const ValueDecl *ValDecl = dyn_cast<ValueDecl>(Decl);
----------------
dougpuob wrote:
> aaron.ballman wrote:
> > `ND` instead of `Decl`.
> > 
> > The function name doesn't really help me to understand why you'd call this as opposed to getting the type information as a string from the `NamedDecl` itself. I'm a bit worried about maintaining this code as the language evolves -- Clang will get new keywords, and someone will have to remember to come update this code. Could you get away with using `Decl->getType()->getAsString()` and working with that rather than going back to the original source code and trying to parse manually?
> OK, I should do it. (ND instead of Decl.)
> 
> Use `Decl->getType()->getAsString()` is a good way. But HN is a strongly-typed notation, if users haven't specific the include directories, the checking result may look messy (it will be changed to unexpected type). So I parse a string with `SourceLocation`, just for user-friendly.
> 
> I understand your consideration, maintaining-friendly is also important. I can implement it with `Decl->getType()->getAsString()`, if my explanation can't convince you still. It is also fine to me. Or you think it just need a better appropriate function name?
> But HN is a strongly-typed notation, if users haven't specific the include directories, the checking result may look messy (it will be changed to unexpected type). 

I'm not certain I understand what issue you're worried about here. Do you have a code example that might help clarify?

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D86671/new/

https://reviews.llvm.org/D86671