[PATCH] D122914: [Windows] Fix handling of \" in program name on cmd line.

Simon Tatham via Phabricator via llvm-commits llvm-commits at lists.llvm.org
Mon Apr 4 01:28:06 PDT 2022


simon_tatham added a comment.

In D122914#3423312 <https://reviews.llvm.org/D122914#3423312>, @rnk wrote:

> We tried to write this code exactly following the rules the CRT uses to produce the `argv[]` array supplied to `main`. How does the CRT startup code handle that? Does it have special handling for the first token to match CreateProcess here?

I haven't seen the source of the real Windows CRT startup code, I'm afraid. But observing it as a black box, yes, it really does seem to handle the first token specially. I compiled and ran this trivial test program with VS2019, which prints both the `GetCommandLine` raw string and the `argv` received by `main`:

  #include <stdio.h>
  #include <windows.h>
  int main(int argc, char **argv) {
    for (int i = 0; i < argc; i++) printf("argv[%d] = [%s]\n", i, argv[i]);
    printf("cl = [%s]\n", GetCommandLineA());
  }

Output with a test string consisting of the exact same syntax twice:

  C:\Users\statham>".\"args ".\"args
  argv[0] = [.\args]
  argv[1] = [."args]
  cl = [".\"args  ".\"args]

You can see that `GetCommandLine` has received the same string I passed to `cmd.exe`, without any pre-transformation by `cmd` or `CreateProcess` to make it more palatable to the CRT (which was another hypothesis I considered). But then the C library argv splitter has treated the two copies of the same string differently: in `argv[0]`, the quotes have been discarded and the backslash kept, whereas in `argv[1]` the backslash has escaped the quote.

(And therefore you'd expect this command line to have ended in mid-quoted string – and running again with further text after the second copy of `".\"args`, it's easy to see that that is indeed the case ...

  C:\Users\statham>".\"args ".\"args one " two
  argv[0] = [.\args]
  argv[1] = [."args one ]
  argv[2] = [two]
  cl = [".\"args  ".\"args one " two]

... which supports your other comment that we ought not to be throwing away non-empty unclosed quoted arguments at the end of the command line.)

> Is the idea that Windows paths are not allowed to contain quotes, so backslash is not allowed to have any escaping power? In other words, double quotes in the first token are treated similarly to bash single quote, they have no escape mechanism.

As I say, I don't know for 100% sure how the CRT really thinks. But what you say is exactly my interpretation (including the rationale about `"` not being a legal filename character).

> I would really like to avoid adding another tokenization variant function.

I hear you. But the problem is that the command-line tokenizer in llvm/Support is used both for full command lines as received from `GetCommandLine` //and// for snippets in things like response files, which only contain arguments intended to come after the command name. And surely one of those needs this special-casing of the first token, and the other one does not, if they're to be parsed correctly. So I don't see how to avoid having two functions, or one function with a boolean flag, or //some// way to trigger two different kinds of parsing for two contexts.



================
Comment at: llvm/lib/Support/CommandLine.cpp:1027-1028
 
   if (State == UNQUOTED)
     AddToken(Saver.save(Token.str()));
 }
----------------
rnk wrote:
> This seems like it could be a bug when the command line ends with an open quoted string. We should match the CRT logic. In the original example, the entire command line would become argv[0].
I thought the same, and I considered fixing it as part of the same commit – I would have expected that we'd add the current token if it had accumulated any text, whether the quotes had been closed or not. But the git history tells me that D78346 changed it from that to this, so I didn't want to unilaterally revert a previous deliberate change.

However, now I look more closely at that previous change, it looks as if it's not intentionally discarding any actual text – it's _preventing_ the discarding of an explicit empty-string token. So perhaps I can fix that after all without reverting the intended change. I think we ought to be testing `State != INIT`.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D122914/new/

https://reviews.llvm.org/D122914



More information about the llvm-commits mailing list