[PATCH] D30691: [analyzer] Support for naive cross translational unit analysis

Wed Nov 29 03:42:27 PST 2017

gerazo added a comment.

Thanks George for the review. I will start working on the code right away. I've tried to answer the simpler cases.

================
Comment at: tools/scan-build-py/libscanbuild/analyze.py:44

+CTU_FUNCTION_MAP_FILENAME = 'externalFnMap.txt'
+CTU_TEMP_FNMAP_FOLDER = 'tmpExternalFnMaps'
----------------
george.karpenkov wrote:
> What would happen when multiple analyzer runs are launched concurrently in the same directory? Would they race on this file?
Yes and no. The 1st, collect part of CTU creates this file by aggregating all data from the build system, while the 2nd part which does the analysis itself only reads it. Multiple analysis can use this file simultaneously without problem. However, multiple collect phases launched on a system does not make sense. In this case, the later one would write over the previous results with the same data.

================
Comment at: tools/scan-build-py/libscanbuild/analyze.py:64
             if need_analyzer(args.build):
-                run_analyzer_parallel(args)
+                run_analyzer_with_ctu(args)
         else:
----------------
george.karpenkov wrote:
> `run_analyzer_with_ctu` is an unfortunate function name, since it may or may not launch CTU depending on the passed arguments. Could we just call it `run_analyzer`?
We have an other run_analyzer method but still, it is a good idead, I will make something better up.

================
Comment at: tools/scan-build-py/libscanbuild/analyze.py:103

+def prefix_with(constant, pieces):
+    """ From a sequence create another sequence where every second element
----------------
george.karpenkov wrote:
> Actually I would prefer a separate NFC PR just moving this function. This PR is already too complicated as it is.
Yes. The only reason was for the move to make it testable. However, we need more tests as you wrote down below.

================
Comment at: tools/scan-build-py/libscanbuild/analyze.py:120
+                  func_map_cmd=args.func_map_cmd)
+        if hasattr(args, 'ctu_phases') and hasattr(args.ctu_phases, 'dir')
+        else CtuConfig(collect=False, analyze=False, dir='', func_map_cmd=''))
----------------
george.karpenkov wrote:
> Extensive `hasattr` usage is often a codesmell, and it hinders understanding.
> Firstly, shouldn't  `args.ctu_phases.dir` be always available, judging from the code in `arguments.py`?
> Secondly, in what cases `args.ctu_phases` is not available when we are already going down the CTU code path? Shouldn't we throw an error at that point instead of creating a default configuration? (default-configurations-instead-of-crashing is also another way to introduce very subtle and hard-to-catch error modes)
It definitely needs more comments, so thanks, I will put them here.
There are two separate possibilities here for this code part:
1. The user does not want CTU at all. In this case, no CTU phases are switched on (collect and analyze), so no CTU code will run. This is why dir has no importance in this case.
2. CTU capabilities were not even detected, so the help and available arguments about CTU are also missing. In this case we create a dummy config telling that everything is switched off.

The reason for using hasattr was that not having CTU capabilities is not an exceptional condition, rather a matter of configuration. I felt that handling this with an exception would be misleading.

================
Comment at: tools/scan-build-py/libscanbuild/analyze.py:128
+    A function map contains the id of a function (mangled name) and the
+    originating source (the corresponding AST file) name."""
+
----------------
george.karpenkov wrote:
> Could you also specify what `func_map_lines` is (file? list? of strings?) in the docstring? There's specific Sphinx syntax for that. Otherwise it is hard to follow.
Thanks, I'll do it.

================
Comment at: tools/scan-build-py/libscanbuild/analyze.py:144
+        if len(ast_files) == 1:
+            mangled_ast_pairs.append((mangled_name, ast_files.pop()))
+
----------------
george.karpenkov wrote:
> Firstly, no need to modify the set in order to get the first element, just use `next(iter(ast_files))`.
> Secondly, when exactly do conflicting names happen? Is it a bug? Shouldn't the tool log such cases?
Nice catch, thanks.
For your second question: Unfortunately, it is not a bug, rather a misfeature which we can't handle yet. There can be tricky build-systems where a single file with different configurations is built multiple times, so the same function signature will be present in multiple link units. The other option is when two separate compile units have an exact same function signature, but they are never linked together, so it is not a problem in the build itself. In this case, the cross compilation unit functionality cannot tell exactly which compilation unit to turn to for the function, because there are multiple valid choices. In this case, we deliberately leave such functions out to avoid potential problems. It deserves a comment I think.

================
Comment at: tools/scan-build-py/libscanbuild/analyze.py:160
+    def generate_func_map_lines(fnmap_dir):
+        """ Iterate over all lines of input files in random order. """
+
----------------
george.karpenkov wrote:
> Firstly, is `glob.glob` actually random, as the docstring is saying?
> If yes, can we make it not to be random (e.g. with `sorted`), as dealing with randomized input makes tracking down bugs so much harder?
"random" here means we don't care. So you are right. glob.glob doc says it is arbitrary but I couldn't find that it is deterministic or not. Probably sorting it will not do harm. 

================
Comment at: tools/scan-build-py/libscanbuild/analyze.py:189
+            # Remove all temporary files
+            shutil.rmtree(fnmap_dir, ignore_errors=True)
+
----------------
george.karpenkov wrote:
> Having an analysis tool remove files is scary, what if (maybe not in this revision, but in a future iteration) a bug is introduced, and the tool removes user code instead?
> Why not just create a temporary directory with `tempfile.mkdtemp`, put all temporary files there, and then simply iterate through them?
> Then you would be able to get rid of the constant `CPU_TEMP_FNMAP_FOLDER` entirely, and OS would be responsible for cleanup.
Yes, you are right. We are essentially using a temp dir. Because of the size we first had to put it next to the project (not on tmp drive for instance) and for debugging purposes we gave a name to it. Still it can be done with mkdtemp as well.

================
Comment at: tools/scan-build-py/libscanbuild/analyze.py:230
+    if ctu_config.collect:
+        shutil.rmtree(ctu_config.dir, ignore_errors=True)
+    if ctu_config.collect and ctu_config.analyze:
----------------
george.karpenkov wrote:
> Similarly to the comment above, I would prefer if analysis tool would not remove files (and I assume those are not huge ones?)
> Can we just use temporary directories?
Unlike above, here we do remove non-temporary data intentionally. The user asks here to do the recollection of CTU data for a fresh start. Because there is no "clean" functionality in the analyzer interface itself, this seemed to be the easiest-on-user solution to save him/her an extra effort or a new command.

================
Comment at: tools/scan-build-py/libscanbuild/analyze.py:262
+        'ANALYZE_BUILD_FORCE_DEBUG': 'yes' if args.force_debug else '',
+        'ANALYZE_BUILD_CTU': json.dumps(get_ctu_config(args))
     })
----------------
george.karpenkov wrote:
> Using JSON-serialization-over-environment-variables is very unorthodox, and would give very bizarre error messages if the user would try to customize settings (I assume that is the intent, right?)
> I think that separating those options into separate ENV variables (if they _have_ to be customizable) would be much better.
To be honest, I could never make a decision over this. The previous version of scan-build-py used extensively environment variables for everything which ended up in a huge mess and env contamination. There was an effort to reduce this to only a few, well-designated ones. Still there is the need of adding new data to the environment.

================
Comment at: tools/scan-build-py/libscanbuild/analyze.py:561
+            except OSError:
+                pass
+        ast_command = [opts['clang'], '-emit-ast']
----------------
george.karpenkov wrote:
> `try/except/pass` is almost always bad.
> When can the error occur? Why are we ignoring it?
I think this code is redundant with the if above.

================
Comment at: tools/scan-build-py/libscanbuild/analyze.py:609
+    # Recover namedtuple from json when coming from analyze_cc
+    if not hasattr(ctu_config, 'collect'):
+        ctu_config = CtuConfig(collect=ctu_config[0],
----------------
george.karpenkov wrote:
> In which case is this branch hit? Isn't improperly formed input argument indicative of an internal error at this stage?
An other part of scan-build-py, analyze_cc uses namedtuple to json format to communicate. However, the names are not coming back from json, so this code helps in this. This is the case when someone uses the whole toolset with compiler wrapping. All the environment variable hassle is also happening because of this. So these env vars are not for user modification (as you've suggested earlier).

================
Comment at: tools/scan-build-py/libscanbuild/clang.py:168
+        return False
+    return True
+
----------------
george.karpenkov wrote:
> I might be missing something here, but why is the ability to call `--version` indicative of CTU support?
> At worst, this can lead to obscuring real bugs: imagine if the user has `args.clang` pointed to broken/non-existent binary, then `is_ctu_capable` would simply return `False` (hiding the original error!), which would show a completely misleading error message.
> 
> Just checking `func_map_cmd` seems better, but even in this case we should probably log any errors occurring on `-version` call (such messages would really aid debugging)
The original idea was that clang can give information about CTU support itself. However, it never happened because the analyzer is so deep down in the system. So I am open to remove the clang binary check here. However, clang binary is needed anyway, so the whole toolset will still throw an error later on not having a clang binary.

https://reviews.llvm.org/D30691