[PATCH] D99406: RFC [lit] Detect if processes to execute are MSys based, apply custom quoting logic

Martin Storsjö via Phabricator via llvm-commits llvm-commits at lists.llvm.org
Thu Apr 8 14:22:31 PDT 2021


mstorsjo added inline comments.


================
Comment at: llvm/utils/lit/lit/TestRunner.py:235
+    5 "a\b\\c\\\\d"   a\b\c\\d       a\b\\c\\\\d        a\b\\c\\\\d
+    6 "a\"b"          a"b            a\b                a"b
+
----------------
rnk wrote:
> mstorsjo wrote:
> > mstorsjo wrote:
> > > rnk wrote:
> > > > mstorsjo wrote:
> > > > > rnk wrote:
> > > > > > Are you sure the msys-noglob case here is correct? That seems wrong...
> > > > > > 
> > > > > > I have always believed that the correct, safe way to quote anything on Windows is to:
> > > > > > - surround each argument in double quotes
> > > > > > - backslash escape every `\` and `"` character
> > > > > > 
> > > > > > Even though this escapes strictly more than is necessary, you usually get the same results. This is for example the algorithm used to produce clang crash reproducer scripts, which is nice because they usually run in both cmd and bash.
> > > > > > 
> > > > > > Can we file a bug against msys about this? I feel like the app really should see `a"b` in this case.
> > > > > I guess the big question is "what is correct"; it's clearly not what you'd expect to be interpreted given the regular win32 arg quoting rules, no. But then again, msys executables also live in a world with lots of other wild requirements with how things are supposed to behave when invoked from various contexts, so it might come from handling some other msys specific usecase of expanding arguments in a more unix-shell-y way. So not sure if it's intentionally this way, or just ends up this way while trying to fulfill all other requirements.
> > > > > 
> > > > > Your generic windows quoting algorithm is off a bit, as you're only supposed to backslash escape `"` and backslashes directly preceding a `"`, but other backslashes are supposed to stay single.
> > > > > 
> > > > > I guess I could file a bug... One problem is where to aim it, as msys is mainly a friendly fork/repackaging of cygwin. I dunno how much it differs from upstream cygwin (I'll need to get a fresh installation of that.) But I guess I could file a bug there, discuss what's expected and what's not, and if possible continue further upstream.
> > > > > 
> > > > > Secondly, which ones of the behaviours should we report as buggy? I guess we could report all of the combinations that do differ from the regular win32 interpretation. If all of them would end up fixed, we wouldn't need to care about anything else than proper win32 quoting. But even by fixing a couple of them, we could find one setup (either quoting just what we need, or quoting excessively, and with or without setting the noglob setting) where things work consistently enough for us to rely on.
> > > > Well, the generic algorithm is meant to work with a wide variety of Unix-y and Windows-y consumers, so it doubles all backslashes. It sounds like the only cases where it doesn't work are either case 6 or related to glob characters.
> > > > 
> > > > As for filing a bug, I'm really focused on case 6: if the command line uses double quotes and globbing is off, MSys should try to strictly follow the unquoting rules. Otherwise, what is the correct way to pass arguments with double quotes to an msys-noglob program?
> > > No, the generic algorithm (that we have today, inherited from the python implementation) doesn't double all backslashes.
> > > 
> > > This is the highly non-obvious part about windows argument parsing, please reread this bit slowly.
> > > 
> > > There's two concepts involved; quoting (to make sure arguments aren't split on spaces), and escaping (to disambiguate quotes/backslashes).
> > > 
> > > On Windows, when parsing a path, doubled backslashes are *only* narrowed down to a single backslash if they're followed by a quote. I.e. if I pass the argument `a\\b`, the called app gets `a\\b`, but if I pass the argument `a\\"b`, the app gets `a\"b`. So the traditionally safe approach of "let's escape anything that might be ambiguous" doesn't work here, we can escape (double) backslashes only if we're in that condition.
> > > 
> > > See the code below here in our python quoting implementation, lines 273, 277 and 283 (in the right hand view after applying this patch); first when we get a backslash we don't know if we can/must double it, so we buffer it (273-274), then when we hit a non-backslash char, we either output the buffered backslashes as-is (283) or doubled (276-277).
> > > 
> > > Now finally, this handling of backslash escaping is (in win32 apps) the same regardless of whether the current argument is enclosed in double quotes or not. On msys however, it's not; adding the double quotes trigger thinning/unescaping of too many cases of double quotes (case 5 vs 2 above).
> > > 
> > > Now back why I think I'll report all the cases, the msys implementation really is a minefield of various cases that differ from the win32 reference:
> > > 
> > > 1. You noticed that a plain `a"b`, which gets escaped into `a\"b`, doesn't get parsed back into `a"b` on msys, so you added `"` to the list of chars that require quoting. This was case 3 in the list above, and quoting makes it case 6 which works fine, so far. This is one bug in msys's parsing, but one which we could live with due to quoting.
> > > 2. I notice that `a[b\c` loses the single backslash on msys (case 1). My initial reaction is to add `\` to the list of characters that trigger quoting the whole string (which makes it case 4), like you did above. This fixes this particular case, but then breaks case 2 which is turned into case 5 above. (We have a lot of cases where there's a number of backslashes e.g. in regexes, where the arguments aren't quoted right now, and quoting them breaks with msys based tools.)
> > > 3. I consider adding `MSYS=noglob`, which fixes case 1, but breaks case 6.
> > > 
> > > I realize now I haven't tested only adding `[` to the list of chars that trigger quoting the string. That might work for now (I'll check it in a moment), but the underlying issue still remains; there doesn't seem to be a single way to consistently quote arguments so that they get parsed correctly for both proper win32 apps and msys based ones. I.e. that case would break if you'd have an argument consisting of both `[` and series of double backslashes (where they're supposed to be escaped for the regex, but msys unescapes one layer too much).
> > Extending the check for chars that require quoting the whole string to also include `[` actually fixes the clang-tidy case without breaking anything else. But the root cause still stands; if a string that contains double backslashes (like case 2) ends up quoted (into case 5), the MSYS tool breaks it by unescaping the backslashes, so it's a rather brittle setup.
> > On Windows, when parsing a path, doubled backslashes are *only* narrowed down to a single backslash if they're followed by a quote.
> 
> Oh yeah, you're right. This approach doesn't work. I was so confident I had to actually a CreateProcess / argv sample program to convince myself.
> 
> > No, the generic algorithm (that we have today, inherited from the python implementation) doesn't double all backslashes.
> 
> The algorithm I'm describing is what we use to make crash reports, not what's used here in lit:
> https://github.com/llvm/llvm-project/blob/4fcb25583c3ccbe10c4367d02086269e5fa0bb87/llvm/lib/Support/Program.cpp#L83
> 
> In any case, I only imagined that it worked. >_>
> 
> > Extending the check for chars that require quoting the whole string to also include [ actually fixes the clang-tidy case without breaking anything else.
> 
> I think that's my preference, then, for now.
> 
Oh, right. That quoting that doubles backslashes might end up working in practice if the backslashes are path separators, and interprets it correctly if the string is passed to a bash shell though.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D99406/new/

https://reviews.llvm.org/D99406



More information about the llvm-commits mailing list