[PATCH] D30691: [analyzer] Support for naive cross translational unit analysis

Mon Nov 27 14:35:02 PST 2017

george.karpenkov requested changes to this revision.
george.karpenkov added inline comments.
This revision now requires changes to proceed.

================
Comment at: tools/scan-build-py/libscanbuild/analyze.py:44

+CTU_FUNCTION_MAP_FILENAME = 'externalFnMap.txt'
+CTU_TEMP_FNMAP_FOLDER = 'tmpExternalFnMaps'
----------------
What would happen when multiple analyzer runs are launched concurrently in the same directory? Would they race on this file?

================
Comment at: tools/scan-build-py/libscanbuild/analyze.py:64
             if need_analyzer(args.build):
-                run_analyzer_parallel(args)
+                run_analyzer_with_ctu(args)
         else:
----------------
`run_analyzer_with_ctu` is an unfortunate function name, since it may or may not launch CTU depending on the passed arguments. Could we just call it `run_analyzer`?

================
Comment at: tools/scan-build-py/libscanbuild/analyze.py:103

+def prefix_with(constant, pieces):
+    """ From a sequence create another sequence where every second element
----------------
Actually I would prefer a separate NFC PR just moving this function. This PR is already too complicated as it is.

================
Comment at: tools/scan-build-py/libscanbuild/analyze.py:120
+                  func_map_cmd=args.func_map_cmd)
+        if hasattr(args, 'ctu_phases') and hasattr(args.ctu_phases, 'dir')
+        else CtuConfig(collect=False, analyze=False, dir='', func_map_cmd=''))
----------------
Extensive `hasattr` usage is often a codesmell, and it hinders understanding.
Firstly, shouldn't  `args.ctu_phases.dir` be always available, judging from the code in `arguments.py`?
Secondly, in what cases `args.ctu_phases` is not available when we are already going down the CTU code path? Shouldn't we throw an error at that point instead of creating a default configuration? (default-configurations-instead-of-crashing is also another way to introduce very subtle and hard-to-catch error modes)

================
Comment at: tools/scan-build-py/libscanbuild/analyze.py:128
+    A function map contains the id of a function (mangled name) and the
+    originating source (the corresponding AST file) name."""
+
----------------
Could you also specify what `func_map_lines` is (file? list? of strings?) in the docstring? There's specific Sphinx syntax for that. Otherwise it is hard to follow.

================
Comment at: tools/scan-build-py/libscanbuild/analyze.py:138
+        else:
+            mangled_to_asts[mangled_name].add(ast_file)
+
----------------
Could be improved with `defaultdict`:

```
mangled_to_asts = defaultdict(set)
...
   mangled_to_asts[mangled_name].add(ast_file)
```

================
Comment at: tools/scan-build-py/libscanbuild/analyze.py:144
+        if len(ast_files) == 1:
+            mangled_ast_pairs.append((mangled_name, ast_files.pop()))
+
----------------
Firstly, no need to modify the set in order to get the first element, just use `next(iter(ast_files))`.
Secondly, when exactly do conflicting names happen? Is it a bug? Shouldn't the tool log such cases?

================
Comment at: tools/scan-build-py/libscanbuild/analyze.py:145
+            mangled_ast_pairs.append((mangled_name, ast_files.pop()))
+
+    return mangled_ast_pairs
----------------
Overall, instead of creating a dictionary with multiple elements, and then converting to a list, it's much simper to only add an element to `mangled_to_asts` when it is not already mapped to something (probably logging a message otherwise), and then just return `mangled_to_asts.items()`

================
Comment at: tools/scan-build-py/libscanbuild/analyze.py:160
+    def generate_func_map_lines(fnmap_dir):
+        """ Iterate over all lines of input files in random order. """
+
----------------
Firstly, is `glob.glob` actually random, as the docstring is saying?
If yes, can we make it not to be random (e.g. with `sorted`), as dealing with randomized input makes tracking down bugs so much harder?

================
Comment at: tools/scan-build-py/libscanbuild/analyze.py:168
+
+    def write_global_map(ctudir, arch, mangled_ast_pairs):
+        """ Write (mangled function name, ast file) pairs into final file. """
----------------
If `write_global_map` is a closure, please use the fact that it would capture `ctudir` from the outer scope, and remove it from the argument list.
That would make it more obvious that it does not change between invocations.

================
Comment at: tools/scan-build-py/libscanbuild/analyze.py:189
+            # Remove all temporary files
+            shutil.rmtree(fnmap_dir, ignore_errors=True)
+
----------------
Having an analysis tool remove files is scary, what if (maybe not in this revision, but in a future iteration) a bug is introduced, and the tool removes user code instead?
Why not just create a temporary directory with `tempfile.mkdtemp`, put all temporary files there, and then simply iterate through them?
Then you would be able to get rid of the constant `CPU_TEMP_FNMAP_FOLDER` entirely, and OS would be responsible for cleanup.

================
Comment at: tools/scan-build-py/libscanbuild/analyze.py:230
+    if ctu_config.collect:
+        shutil.rmtree(ctu_config.dir, ignore_errors=True)
+    if ctu_config.collect and ctu_config.analyze:
----------------
Similarly to the comment above, I would prefer if analysis tool would not remove files (and I assume those are not huge ones?)
Can we just use temporary directories?

================
Comment at: tools/scan-build-py/libscanbuild/analyze.py:241
+        run_analyzer_parallel(args)
+        shutil.rmtree(ctu_config.dir, ignore_errors=True)
+    else:
----------------
Same as the comment above about removing folders. Also it seems like there should be a way to remove redundancy in `if collect / remove tree` block repeated twice.

================
Comment at: tools/scan-build-py/libscanbuild/analyze.py:262
+        'ANALYZE_BUILD_FORCE_DEBUG': 'yes' if args.force_debug else '',
+        'ANALYZE_BUILD_CTU': json.dumps(get_ctu_config(args))
     })
----------------
Using JSON-serialization-over-environment-variables is very unorthodox, and would give very bizarre error messages if the user would try to customize settings (I assume that is the intent, right?)
I think that separating those options into separate ENV variables (if they _have_ to be customizable) would be much better.

================
Comment at: tools/scan-build-py/libscanbuild/analyze.py:296
+        'command': [execution.cmd[0], '-c'] + compilation.flags,
+        'ctu': json.loads(os.getenv('ANALYZE_BUILD_CTU'))
     }
----------------
Again, is it possible to avoid JSON-over-environment-variables?

================
Comment at: tools/scan-build-py/libscanbuild/analyze.py:534
+        mangled_name = fn_src_txt[0:dpos]
+        path = fn_src_txt[dpos + 1:]
+        # Normalize path on windows as well
----------------
`mangled_name, path = fn_src_txt.split(" ", 1)` ?

================
Comment at: tools/scan-build-py/libscanbuild/analyze.py:561
+            except OSError:
+                pass
+        ast_command = [opts['clang'], '-emit-ast']
----------------
`try/except/pass` is almost always bad.
When can the error occur? Why are we ignoring it?

================
Comment at: tools/scan-build-py/libscanbuild/analyze.py:567
+        ast_command.append('-o')
+        ast_command.append(ast_path)
+        logging.debug("Generating AST using '%s'", ast_command)
----------------
The above can be written more succinctly as:
`ast_command = [opts['clang'], ...] + args + ['-w', ...]`

================
Comment at: tools/scan-build-py/libscanbuild/analyze.py:578
+        funcmap_command.append('--')
+        funcmap_command.extend(args)
+        logging.debug("Generating function map using '%s'", funcmap_command)
----------------
Similarly here, `funcmap_command` can be generated in one line using `+`

================
Comment at: tools/scan-build-py/libscanbuild/analyze.py:588
+            except OSError:
+                pass
+        if func_ast_list:
----------------
Again, why is this error ignored?

================
Comment at: tools/scan-build-py/libscanbuild/analyze.py:609
+    # Recover namedtuple from json when coming from analyze_cc
+    if not hasattr(ctu_config, 'collect'):
+        ctu_config = CtuConfig(collect=ctu_config[0],
----------------
In which case is this branch hit? Isn't improperly formed input argument indicative of an internal error at this stage?

================
Comment at: tools/scan-build-py/libscanbuild/analyze.py:702

+
 # To have good results from static analyzer certain compiler options shall be
----------------
This blank line should not be in this PR.

================
Comment at: tools/scan-build-py/libscanbuild/clang.py:168
+        return False
+    return True
+
----------------
I might be missing something here, but why is the ability to call `--version` indicative of CTU support?
At worst, this can lead to obscuring real bugs: imagine if the user has `args.clang` pointed to broken/non-existent binary, then `is_ctu_capable` would simply return `False` (hiding the original error!), which would show a completely misleading error message.

Just checking `func_map_cmd` seems better, but even in this case we should probably log any errors occurring on `-version` call (such messages would really aid debugging)

================
Comment at: tools/scan-build-py/libscanbuild/clang.py:177
+    arch = ""
+    i = 0
+    while i < len(cmd) and cmd[i] != "-triple":
----------------
Seconded, would prefer this rewritten using `separator = cmd.find('-triple')`

================
Comment at: tools/scan-build-py/libscanbuild/report.py:270

+    # Protect parsers from bogus report files coming from clang crashes
+    for filename in glob.iglob(os.path.join(output_dir, pattern)):
----------------
I understand the intent here, but it seems it should be handled at a different level: would it be hard to change Clang to only write the report file at the very end, when no crash should be encountered? Or make parsers not choke on empty fields?

================
Comment at: tools/scan-build-py/tests/unit/test_analyze.py:338
+
+class PrefixWithTest(unittest.TestCase):
+
----------------
Probably more tests are required for almost 400 lines of functional Python code in this PR.
Would it be hard to have a full LIT-style integration test? E.g. have a dummy script emulating Clang with a dummy directory structure, which would show how all pieces are meant to fit together?

https://reviews.llvm.org/D30691