[PATCH] D112735: export unique symbol list with llvm-nm new option "--export-symbols"

Fri Feb 4 01:37:41 PST 2022

jhenderson requested changes to this revision.
jhenderson added a comment.
This revision now requires changes to proceed.

In D112735#3294424 <https://reviews.llvm.org/D112735#3294424>, @DiggerLin wrote:

> In D112735#3292985 <https://reviews.llvm.org/D112735#3292985>, @jhenderson wrote:
>
>> As mentioned a bit inline, I think you should do some of the refactoring in separate patches, to simplify the number of changes being added here.
>>
>> 1. Pull out the `isSymbolDefined` code as you've already done.
>> 2. Split out/update the symbol sorting code as you've done/as I've suggested in the inline comments in this patch.
>
>
>
>   I think it is too late to create a separate refactor patch now ,we already have done several review on it.  the code is already in the patch, I do not think it is a good idea  to spend time to create a separate refactor patch and rebase current patch now. 

It is never too late to split up a patch. I am not happy with this patch in its current state, plus commits should be kept as small as practical. That way, if there is a problem, it's easier to identify where the problem might be, plus it makes reviewing of individual bits of it significantly simpler. Remember, it .

>> I think you should also do a refactor whereby SymbolList is no longer a global variable - having it as a global variable is confusing, at best, and at worst risks cases where it is still populated when it shouldn't be.
>
>   for the export list, I think we need a global variable , otherwise we need a variable in main() , and then pass into each function.  The llvm-nm is c style code, I do not think we can do all the refactor in this patch.   if you have have good idea, please let me know, I will create  a separate refactor patch after current patch land.

My suggested design I already provided in my previous comment actively needs to avoid the global variable - the proposed `getSymbolsFromObject` function would return this list for use by the respective calling functions. Global variables make code re-use harder, and lead to temporal dependencies, both of which this patch is inflicted with, in part due to the global `SymbolList`. Doing the refactor up-front would significantly improve this patch.

>> Finally, I think having some printing happen before the symbols have been gathered may be a bad idea. At the moment, things like the input filename are printed quite early on, which isn't what we want in some cases. There's no reason this couldn't be deferred until the printing stage, so I'd look at a refactor that does this.
>>
>> At the moment, the regular behavior is essentially:
>>
>>   for each input file (and archive member):
>>       get all symbols
>>       sort them
>>       print them according to output format, filtering according to options
>>
>> A possible additional refactor to the existing code would be to move filtering to the "getting" point, so that they are never added to the symbol list in the first place.
>
>
>
>   Agree with you, we should do filter when get the all the symbols a separate refactor patch after the patch landed. there are several functions need to change include function dumpSymbolsFromDLInfoMachO()  

Doing this one later is fine - I don't think it makes much difference whether filtering is done up-front or on-printing in this case.

>> You'll notice that most of these steps, and the general structure are pretty similar. Consequently, I think we can restructure your new code a bit more logically. At the end of main, instead of where we currently have the llvm::for_each to dump symbol names, and then a separate if, you do something like the following:
>>
>>   if (ExportSymbols)
>>     printExportSymbolList();
>>   else
>>     llvm::for_each(InputFilenames, dumpSymbolNamesFromFile);
>>
>> `dumpSymbolNamesFromFile` (or more precisely `dumpSymbolNamesFromObject`) will then call a function (created by moving code out of `dumpSymbolNamesFromObject`) which gets the symbols and adds them to the symbol list. Let's call this `getSymbolsFromObject` for our purposes here. Beyond that, the process remains essentially the same. `printExportSymbolList` loops over all inputs and calls `getSymbolsFromObject` on each, storing the resultant symbols in a single combined list. It then sorts and merges the list before printing. The merging and printing is unique to this code path. The gathering, filtering and sorting is shared code with the regular code path.
>>
>> Does that make sense? This isn't too far from the current implementation, but there are some subtle differences that should stop the need for lots of `if(ExportSymbols)` type checks (the only one necessary should be the one in `main`).
>
>
>
> 1. Actually , there is only two  additional  if(ExportSymbols) in the dumpSymbolNamesFromObject() , I do not think it is big problem.

The specific if checks are just a symptom of a wider structural problem with this patch, which if not improved will lead to brittleness and the potential for bugs. I don't think we should allow this to happen.

> 2. printExportSymbolList loops over all inputs and calls getSymbolsFromObject , I think it need to share the code the dumpSymbolNamesFromFile. and we need to added a additional  if(ExportSymbols) in the dumpSymbolNamesFromFile to go to different function .

I'm not sure I follow. In my proposed design, `dumpSymbolNamesFromFile` should never be touched by the export symbols code route, only `getSymbolsFromObject`.

> So I do think there is benefit.

Did you mean "do not"?

================
Comment at: llvm/test/tools/llvm-nm/XCOFF/export-symbols.test:10
+## Do not export hidden and internal symbols.
+## Remove name prefixes of global symbols beginning with "__tf1" and "__tf9".
+# RUN: llvm-nm --export-symbols %t1.o | FileCheck %s --check-prefixes=COMMON,WEAK --implicit-check-not={{.}}
----------------
DiggerLin wrote:
> jhenderson wrote:
> > Is it supposed to explicitly be "\_\_tf1" and "\_\_tf9" only, or should this include e.g. \_\_tf2, \_\_tf3 etc?
> > 
> > If it should include, I'd reword this comment. If it shouldn't, I'd add an additional symbol called __tf2... to show that the prefix isn't removed in that case.
> 1. it is only "__tf1....." and "__tf9...." 
> 2. for "__tf1...."  remove the first 6 char prefix
>  3. for "__tf9..." remove the first 15 char prefix.
> 
> the shell script of IBM original CreateExportList , $NF is symbol name.
> 
> ....
> 
> if (substr ($NF, 1, 7) != "__sinit" &&
>                  substr ($NF, 1, 7) != "__sterm" &&
>                  match($NF,/^__[0-9]+__/)==0 ) {
>                 if (substr ($NF, 1, 5) == "__tf1")
>                                          print (substr ($NF, 7)) VISIBILITY
>                 else if (substr ($NF, 1, 5) == "__tf9")
>                                          print (substr ($NF, 15)) VISIBILITY
>                 else if (!('$incl_resource' && $NF == "__rsrc"))
>                                          print $NF VISIBILITY
>              }
> ....
>  I'd add an additional symbol called __tf2... to show that the prefix isn't removed in that case.

Please address this comment then.

================
Comment at: llvm/tools/llvm-nm/llvm-nm.cpp:243

-static bool compareSymbolSize(const NMSymbol &A, const NMSymbol &B) {
-  return std::make_tuple(A.Size, A.Name, A.Address) <
-         std::make_tuple(B.Size, B.Name, B.Address);
-}
+  bool operator<(const NMSymbol &B) const {
+    if (NumericSort)
----------------
DiggerLin wrote:
> jhenderson wrote:
> > Sorry, didn't think through my comment from yesterday enough. Make this an out-of-class operator, i.e.
> > 
> > ```
> > bool operator<(const NMSymbol &A, const NMSymbol &B) { ... }
> > ```
> > 
> > Then implement the equality operator simply as follows:
> > ```
> > bool operator==(const NMSymbol &A, const NMSymbol &B) {
> >   return !(A < B) && !(B < A);
> > }
> > ```
> 1. I just wonder what is benefit to change to out-of-class operator ?  the benefit is we can use llvm::stable_sort(SymbolList, operator>); ?
> 
> since I do not think it is effecient to implement 
> 
> ```
> bool operator>(const NMSymbol &A, const NMSymbol &B) {
>   return A != B && !(A < B);
> } 
> ```
> 
> I explain below. 
> 
> 
> 2 .
> ```
> bool operator==(const NMSymbol &A, const NMSymbol &B) {
>   return !(A < B) && !(B < A);
> }
> ```
> the code is not efficient,  for example , when we compare to string equal, 
> 
> str1==str2 is more efficient than !(str1 < str2) && !(str2 <str1)
> 
> using str1 == str2 only go through each byte of the string once, 
> but 
> !(str1 < str2) && !(str2 <str1) 
> need go through each byte of the string twice.
Yes, an out-of-class operator is simpler to use as a predicate for sorting and other algorithm functions.

I think you are being unnecessarily concerned about performance. Whilst it is important to be aware of performance, I think you'll find that the optimizer will ensure that there is no additional runtime cost to implementing the operators as suggested. If you still are concerned, try implementing it both ways and running some tests to compare the total time it takes to run llvm-nm in each case, when sorting a large list of symbols.

================
Comment at: llvm/tools/llvm-nm/llvm-nm.cpp:680
+    llvm::stable_sort(SymbolList, [=](const NMSymbol &A, const NMSymbol &B) {
+      return B < A;
+    });
----------------
DiggerLin wrote:
> jhenderson wrote:
> > Consider implementing `operator>` as follows:
> > ```
> > bool operator!=(const NMSymbol &A, const NMSymbol &B) {
> >   return !(A == B);
> > }
> > 
> > bool operator>(const NMSymbol &A, const NMSymbol &B) {
> >   return A != B && !(A < B);
> > }
> > ```
> > and then using that directly as the predicate without the need for the lamdba:
> > ```
> > llvm::stable_sort(SymbolList, operator>);
> > ```
> same reason 
> bool operator>(const NMSymbol &A, const NMSymbol &B) {
>   return A != B && !(A < B);
> } 
> 
> this need to go through each bytes twice , not efficient.
As noted above: the optimizer will likely take care of this for you.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D112735/new/

https://reviews.llvm.org/D112735