[PATCH] Teach Clang how to use response files when calling other tools

Mon Aug 25 19:56:10 PDT 2014

Hi Sean,

Thanks a lot for your careful review. I have answered your questions below, and I have few concerns. Please let me know if I am missing something:

This code is on the critical path for commands with very large lengths and this is a non-negligible time. For UNIX, for example, this would be triggered for commands spanning more than 2MB of arguments and Clang spends a good amount of time to parse even trivial arguments, because it is a huge number of arguments. 

Another important point is that, currently, the Command class stores arguments as a list of char *.

This is the reason for this design with lots of naked char *. It avoids creating StringRefs, that would measure the length of char* in their constructor and cause more overhead. It also avoids using high-level classes such as streams and bump allocators because, since this is likely to be called only when the number of arguments is very large, would cause them to repeatedly perform heap allocations until they finally settle to support the final size of the string. Either that or we would need to work with very large slabs, wasting memory. In this design, the length is calculated once and a single allocation is made, similar to what is done in Execute() in Support/Windows/Program.inc.

Therefore, by calculating the size of the final command line beforehand, even though it looks like we are losing by iterating over all args twice (one for length calculation and other for parsing), we win by eliminating inefficiencies in the allocation mechanism in runtime or in memory consumption.

Regards,
Rafael

================
Comment at: lib/Driver/Job.cpp:122
@@ +121,3 @@
+/// will escape the regular backslashes (used in Windows paths) and quotes.
+static std::unique_ptr<char[]> FlattenArgs(const char **args) {
+  // First, determine the length of the command line.
----------------
silvas wrote:
> This would be a lot nicer with ArrayRef<const char *>; that way, you won't lose the length.
That's a good point. However, I am afraid that I would still need to recalculate the length by traversing each argument, calculating its length and assessing whether it needs quoting or not.

================
Comment at: lib/Driver/Job.cpp:169
@@ +168,3 @@
+  llvm::SmallVector<const char *, 128> NewArgv;
+  unsigned len = 0;
+  for (auto Input : Inputs) {
----------------
silvas wrote:
> Check your naming here and in other places.
Thanks for pointing out, I fixed the naming. Will submit the patch soon.

================
Comment at: lib/Driver/Tools.h:546-550
@@ +545,7 @@
+    llvm::sys::EncodingMethod getResponseFileEncoding() const override {
+#if defined(LLVM_ON_WIN32)
+      return llvm::sys::EM_CurrentCodePage;
+#else
+      return llvm::sys::EM_UTF8;
+#endif
+    }
----------------
silvas wrote:
> Ew... let's not have a bunch of random #ifdef's all over the code outside libSupport.
I agree that this is inadequate. My previous patch version delegated this reasoning to libSupport. However, it was more difficult to understand and I changed it for clarity. Now, it is obvious that the response file encoding for this tool depends on whether it is running on Windows or not. In my previous libSupport version, I had enum members that meant "this is UTF on UNIX but UTF16 on Windows", and so on. I think this is not a good design because if you need to support a new tool, you may potentially need to add a new enum member in libSupport to encode your tuple <MyUNIXEncoding, MyWindowsEncoding>.

Since the encoding depends on the tool and on the operating system, I would appreciate if you have any suggestion on how to code this. Meanwhile, I will think in something.

http://reviews.llvm.org/D4897