[PATCH] D30691: [analyzer] Support for naive cross translational unit analysis

Wed Dec 6 14:03:12 PST 2017

george.karpenkov added inline comments.

================
Comment at: tools/scan-build-py/libscanbuild/analyze.py:44

+CTU_FUNCTION_MAP_FILENAME = 'externalFnMap.txt'
+CTU_TEMP_FNMAP_FOLDER = 'tmpExternalFnMaps'
----------------
gerazo wrote:
> george.karpenkov wrote:
> > What would happen when multiple analyzer runs are launched concurrently in the same directory? Would they race on this file?
> Yes and no. The 1st, collect part of CTU creates this file by aggregating all data from the build system, while the 2nd part which does the analysis itself only reads it. Multiple analysis can use this file simultaneously without problem. However, multiple collect phases launched on a system does not make sense. In this case, the later one would write over the previous results with the same data.
> However, multiple collect phases launched on a system does not make sense

Why not? What about a system with multiple users, where code is in the shared directory? Or even simpler: I've launched a background process, forgot about it, and then launched it again?

> In this case, the later one would write over the previous results with the same data.

That is probably fine, I am more worried about racing, where process B would be reading a partially overriden file (not completely sure whether it is possible)

================
Comment at: tools/scan-build-py/libscanbuild/analyze.py:103

+def prefix_with(constant, pieces):
+    """ From a sequence create another sequence where every second element
----------------
gerazo wrote:
> george.karpenkov wrote:
> > Actually I would prefer a separate NFC PR just moving this function. This PR is already too complicated as it is.
> Yes. The only reason was for the move to make it testable. However, we need more tests as you wrote down below.
Sure, but that separate PR can also include tests.

================
Comment at: tools/scan-build-py/libscanbuild/analyze.py:120
+                  func_map_cmd=args.func_map_cmd)
+        if hasattr(args, 'ctu_phases') and hasattr(args.ctu_phases, 'dir')
+        else CtuConfig(collect=False, analyze=False, dir='', func_map_cmd=''))
----------------
gerazo wrote:
> george.karpenkov wrote:
> > Extensive `hasattr` usage is often a codesmell, and it hinders understanding.
> > Firstly, shouldn't  `args.ctu_phases.dir` be always available, judging from the code in `arguments.py`?
> > Secondly, in what cases `args.ctu_phases` is not available when we are already going down the CTU code path? Shouldn't we throw an error at that point instead of creating a default configuration? (default-configurations-instead-of-crashing is also another way to introduce very subtle and hard-to-catch error modes)
> It definitely needs more comments, so thanks, I will put them here.
> There are two separate possibilities here for this code part:
> 1. The user does not want CTU at all. In this case, no CTU phases are switched on (collect and analyze), so no CTU code will run. This is why dir has no importance in this case.
> 2. CTU capabilities were not even detected, so the help and available arguments about CTU are also missing. In this case we create a dummy config telling that everything is switched off.
> 
> The reason for using hasattr was that not having CTU capabilities is not an exceptional condition, rather a matter of configuration. I felt that handling this with an exception would be misleading.
Right, but instead of doing (1) and (2) can we have a separate (maybe hidden from user) param called e.g. `ctu_enabled` which would explicitly communicate that?

================
Comment at: tools/scan-build-py/libscanbuild/analyze.py:144
+        if len(ast_files) == 1:
+            mangled_ast_pairs.append((mangled_name, ast_files.pop()))
+
----------------
gerazo wrote:
> george.karpenkov wrote:
> > Firstly, no need to modify the set in order to get the first element, just use `next(iter(ast_files))`.
> > Secondly, when exactly do conflicting names happen? Is it a bug? Shouldn't the tool log such cases?
> Nice catch, thanks.
> For your second question: Unfortunately, it is not a bug, rather a misfeature which we can't handle yet. There can be tricky build-systems where a single file with different configurations is built multiple times, so the same function signature will be present in multiple link units. The other option is when two separate compile units have an exact same function signature, but they are never linked together, so it is not a problem in the build itself. In this case, the cross compilation unit functionality cannot tell exactly which compilation unit to turn to for the function, because there are multiple valid choices. In this case, we deliberately leave such functions out to avoid potential problems. It deserves a comment I think.
> In this case, we deliberately leave such functions out to avoid potential problems

We probably want to log it, don't we?

================
Comment at: tools/scan-build-py/libscanbuild/analyze.py:145
+            mangled_ast_pairs.append((mangled_name, ast_files.pop()))
+
+    return mangled_ast_pairs
----------------
gerazo wrote:
> george.karpenkov wrote:
> > Overall, instead of creating a dictionary with multiple elements, and then converting to a list, it's much simper to only add an element to `mangled_to_asts` when it is not already mapped to something (probably logging a message otherwise), and then just return `mangled_to_asts.items()`
> The reason for the previous is that we need to count the occurence number ofdifferent mappings only let those pass through which don't have multiple variations.
ah, OK!

================
Comment at: tools/scan-build-py/libscanbuild/analyze.py:189
+            # Remove all temporary files
+            shutil.rmtree(fnmap_dir, ignore_errors=True)
+
----------------
gerazo wrote:
> gerazo wrote:
> > george.karpenkov wrote:
> > > Having an analysis tool remove files is scary, what if (maybe not in this revision, but in a future iteration) a bug is introduced, and the tool removes user code instead?
> > > Why not just create a temporary directory with `tempfile.mkdtemp`, put all temporary files there, and then simply iterate through them?
> > > Then you would be able to get rid of the constant `CPU_TEMP_FNMAP_FOLDER` entirely, and OS would be responsible for cleanup.
> > Yes, you are right. We are essentially using a temp dir. Because of the size we first had to put it next to the project (not on tmp drive for instance) and for debugging purposes we gave a name to it. Still it can be done with mkdtemp as well.
> Finally, I came to the conclusion that mkdtemp would not be better than the current solution. In order to find our created dir by other threads, we need a designated name. Suffixing it by generated name would further complicate things as we need not to allow multiple concurrent runs here. The current solution is more robust from this point of view.
OK

================
Comment at: tools/scan-build-py/libscanbuild/analyze.py:230
+    if ctu_config.collect:
+        shutil.rmtree(ctu_config.dir, ignore_errors=True)
+    if ctu_config.collect and ctu_config.analyze:
----------------
gerazo wrote:
> george.karpenkov wrote:
> > Similarly to the comment above, I would prefer if analysis tool would not remove files (and I assume those are not huge ones?)
> > Can we just use temporary directories?
> Unlike above, here we do remove non-temporary data intentionally. The user asks here to do the recollection of CTU data for a fresh start. Because there is no "clean" functionality in the analyzer interface itself, this seemed to be the easiest-on-user solution to save him/her an extra effort or a new command.
OK!

================
Comment at: tools/scan-build-py/libscanbuild/analyze.py:296
+        'command': [execution.cmd[0], '-c'] + compilation.flags,
+        'ctu': json.loads(os.getenv('ANALYZE_BUILD_CTU'))
     }
----------------
gerazo wrote:
> george.karpenkov wrote:
> > Again, is it possible to avoid JSON-over-environment-variables?
> There is an other thing against changing this. Currently the interface here using env variables is used by intercept-build, analyze-build and scan-build tool as well. In order to drop json, we need to change those tools too. It would be a separate patch definitely.
OK I didn't know that the JSON interface was used by other tools. In that case, ignore my comment.

================
Comment at: tools/scan-build-py/libscanbuild/analyze.py:567
+        ast_command.append('-o')
+        ast_command.append(ast_path)
+        logging.debug("Generating AST using '%s'", ast_command)
----------------
gerazo wrote:
> george.karpenkov wrote:
> > The above can be written more succinctly as:
> > `ast_command = [opts['clang'], ...] + args + ['-w', ...]`
> After several iterations of the code, I find it easier to version control such multiline constructs. If someone changes a data source, it is clear which one (which line) was modified. The succint notation does not allow clean VCS annotations.
OK. Though you could still use split addition across multiple lines with `\`

================
Comment at: tools/scan-build-py/libscanbuild/analyze.py:609
+    # Recover namedtuple from json when coming from analyze_cc
+    if not hasattr(ctu_config, 'collect'):
+        ctu_config = CtuConfig(collect=ctu_config[0],
----------------
gerazo wrote:
> george.karpenkov wrote:
> > In which case is this branch hit? Isn't improperly formed input argument indicative of an internal error at this stage?
> An other part of scan-build-py, analyze_cc uses namedtuple to json format to communicate. However, the names are not coming back from json, so this code helps in this. This is the case when someone uses the whole toolset with compiler wrapping. All the environment variable hassle is also happening because of this. So these env vars are not for user modification (as you've suggested earlier).
OK so `opts['ctu']` is a tuple or a named tuple depending on how this function is entered? BTW could you point me to the `analyze_cc` entry point?

For the purpose of having more uniform code with less cases to care about, do you think we could just use ordinary tuples instead of constructing a named one, since we have to deconstruct an ordinary tuple in any case?

================
Comment at: tools/scan-build-py/libscanbuild/arguments.py:376
+            metavar='<ctu-dir>',
+            default='ctu-dir',
+            help="""Defines the temporary directory used between ctu
----------------
BTW can we also explicitly add `dest='ctu_dir'` here, as otherwise I was initially very confused as to where the variable is set.

================
Comment at: tools/scan-build-py/libscanbuild/clang.py:168
+        return False
+    return True
+
----------------
gerazo wrote:
> george.karpenkov wrote:
> > I might be missing something here, but why is the ability to call `--version` indicative of CTU support?
> > At worst, this can lead to obscuring real bugs: imagine if the user has `args.clang` pointed to broken/non-existent binary, then `is_ctu_capable` would simply return `False` (hiding the original error!), which would show a completely misleading error message.
> > 
> > Just checking `func_map_cmd` seems better, but even in this case we should probably log any errors occurring on `-version` call (such messages would really aid debugging)
> The original idea was that clang can give information about CTU support itself. However, it never happened because the analyzer is so deep down in the system. So I am open to remove the clang binary check here. However, clang binary is needed anyway, so the whole toolset will still throw an error later on not having a clang binary.
> so the whole toolset will still throw an error later on not having a clang binary.

Of course, but I think that would be easier to debug, and the error would mean that Clang is not available, not that CTU is not working.

https://reviews.llvm.org/D30691