[PATCH] D100651: [AIX] Support of Big archive (read)

Fri Jul 9 02:07:15 PDT 2021

jhenderson added a comment.

Okay, I've spent a bit of time doing some more reviewing here.

Some high-level comments:

1. You should get some people who are more familiar with XCOFF/AIX to help review correctness. There has been a lot of development for this in other tools (e.g. llvm-readobj).
2. Update the review summary to be the body of the proposed commit message.
3. The "Big Archive" format seems to be an AIX-specific thing. I wonder if we should refer to it as the "AIX Big Archive" format everywhere, to avoid confusion. See also inline comments.
4. I think you should be using the archive member headers directly more often.

I haven't had time to fully go through this review, but this should get you started.

================
Comment at: llvm/include/llvm/Object/Archive.h:337

-  enum Kind { K_GNU, K_GNU64, K_BSD, K_DARWIN, K_DARWIN64, K_COFF };
+  enum Kind { K_GNU, K_GNU64, K_BSD, K_DARWIN, K_DARWIN64, K_COFF, K_XCOFF };

----------------
Archives can generally hold files of any format, including non-object files like text files. It seems to me the corect name should be `K_AIXBIG`. What do you think?

================
Comment at: llvm/include/llvm/Object/Archive.h:371-373
+  // On Big Archive, total length is needed
+  // because end of file is member table and global symbol table.
+  uint32_t Length = 0;
----------------
Do you really need this length? Why is it `uint32_t`?

================
Comment at: llvm/include/llvm/Object/Archive.h:391
+  // so we need to memorize position.
+  static uint64_t CurrentLocation;
+
----------------
EGuesnet wrote:
> rupprecht wrote:
> > Does this need to be static? e.g. this could break if any tool needed to read two separate big archives
> As far as I know, static variable is needed. Location must be modified by Child and must be stored in Parent. As Child cannot modify Parent, it is the easiest way to deal with.
> I agree it could break other tools that read two separate big archive, but I have tested this case: Rust reads lot of separate archives, and probably read them twice in some cases. And it works. The first version was broken due to static variables, but it is now correctly implemented.
This is completely unthread safe - if you have a tool parsing two archives simultaneously, this will fail horribly. Do you really need it though? See my comments further down.

>From reading the spec, the archive uses offsets everywhere and it should be simple to use the values stored in the member headers.

================
Comment at: llvm/lib/Object/Archive.cpp:484

+// Child constructors
 Archive::Child::Child(const Archive *Parent, StringRef Data,
----------------
Don't add this comment. We can see that these are child constructors by the function signature.

================
Comment at: llvm/lib/Object/Archive.cpp:490
+  if (Parent->kind() != K_XCOFF) {
+    Header = std::unique_ptr<ArchiveMemberHeader>(new ArchiveMemberHeader(Parent, Data.data(), Data.size(), nullptr));
+  } else {
----------------
Use `std::make_unique` to create a unique pointer without needing to explicitly mention `new`. Same everywhere else you've done this.

================
Comment at: llvm/lib/Object/Archive.cpp:499
+    : Parent(Parent) {
+
+  if (!Start) {
----------------
Don't add a blank line at the function start.

================
Comment at: llvm/lib/Object/Archive.cpp:648

 Expected<Archive::Child> Archive::Child::getNext() const {
   size_t SpaceToSkip = Data.size();
----------------
It seems to me like the logic you've added to this function is a little odd. I think you may be trying too hard to match the existing logic, when it doesn't really make sense. In traditional UNIX ar format, each child is immediately followed by the next one (barring a possible even-alignment padding byte), stopping when you get to the end. My reading of the Big Archive spec is that the next child's offset is defined by the ar_nxtmem member of its header. You should just be using that and fl_lstmoff to identify if the current child is the last one or not. You can then check to see if the child goes past the buffer end, as per the existing check in this file.

================
Comment at: llvm/lib/Object/Archive.cpp:763-766
   // ArchiveMemberHeader::getName() is made.  This could be a valid empty
   // archive which is the same in all formats.  So claiming it to be gnu to is
   // fine if not totally correct before we look for a string table or table of
   // contents.
----------------
This comment implies an empty archive is the same across all formats, but it looks like that's not the case for the Big Archive format. As I understand from my reading, we can't iterate over children to see if the archive is truly empty using the GNU format, for an AIX archive, so the comment needs updating.

================
Comment at: llvm/lib/Object/Archive.cpp:801
+  // AIX Big Archive is totally different that all other.
+  if (Buffer.startswith(BigMagic)) {
+    Format = K_XCOFF;
----------------
I'd move this logic below the following comment, since although the process is different, it is still determining the archive format.

You can simply add a note saying something like:
```
// AIX Big archive format
//   Identified purely by magic bytes and uses a unique format.
```

================
Comment at: llvm/lib/Object/Archive.cpp:803-804
+    Format = K_XCOFF;
+    // Length of archive (all object file + header)
+    // is offset to member table, located at 8->27.
+    Buffer.substr(8, 20).rtrim(" ").getAsInteger(10, Length);
----------------
Don't reflow your comments before the 80-character column width. Let clang-format do that for you.

I'm not really sure what this comment is trying to say, if I'm honest.

================
Comment at: llvm/lib/Object/Archive.cpp:1016

 Expected<Archive::Child> Archive::Symbol::getMember() const {
   const char *Buf = Parent->getSymbolTable().begin();
----------------
Do you actually need to change this method in the first instance? You don't have any testing for symbol printing (and I don't think you should at this point...).

================
Comment at: llvm/lib/Object/Archive.cpp:1021-1024
+  else if (Parent->kind() == K_XCOFF)
+    Offsets += 20;
+  // Each offset is 20 bytes long
   else
----------------

================
Comment at: llvm/lib/Object/ArchiveWriter.cpp:141
+  case object::Archive::K_XCOFF:
+      llvm_unreachable("Not handled yet");
   case object::Archive::K_COFF:
----------------
Make sure to clang-format all your modified code.

I'd actually use `report_fatal_error()` rather than `llvm_unreachable()`, if the code could be reached by a tool receiving a file of this wrong kind, with the corresponding options. If it is genuinely unreachable of course, there is no need to change.

Same below.

================
Comment at: llvm/test/Object/archive-big-read.test:1
+# Test reading an archive created by AIX ar (Big Archive)
+RUN: env TZ=GMT llvm-ar tv %p/Inputs/Big.a | FileCheck %s -strict-whitespace
----------------
Two related nits:
1) Although strictly not necessary, I find it slightly disorientating when seeing a lit test without comment markers for RUN and CHECK lines. I'd add them.
2) For true comments, I encourage in newer tests to use `##` to disitnguish from CHECK/RUN lines.

Also, comments should end with a full stop ('.') - applies below too.

================
Comment at: llvm/test/Object/archive-big-read.test:2
+# Test reading an archive created by AIX ar (Big Archive)
+RUN: env TZ=GMT llvm-ar tv %p/Inputs/Big.a | FileCheck %s -strict-whitespace
+CHECK:      rw-r--r-- 0/0      8 Apr 21 14:12 2021 evenlen
----------------
I'd rename the binary to `bigFormat.a` to avoid confusing it with a large regular archive.

Also, a personal preference is to use double dash for long options (i.e. `--strict-whitespace`).

Also, I'd add `--implicit-check-not={{.}}` to the FileCheck line. Without it the following output would be successfully matched against, because FileCheck matches sub-strings:
```rw-r--r-- 0/0      8 Apr 21 14:12 2021 evenlendsabdjasbababfjbasjk```
It also ensures there's no output before the first line or after the last checked line.

================
Comment at: llvm/test/Object/archive-big-read.test:7
+
+# Test extraction of a file
+RUN: llvm-ar p %p/Inputs/Big.a evenlen | FileCheck %s -check-prefix=EVENLEN
----------------
I think we should test each operation in isolation. In other words, test `t` in a separate file to `p`. One is about reading the archive member list, the other about the archive member contents, so although the latter relies on the former, the inverse isn't true.

================
Comment at: llvm/test/Object/archive-big-read.test:8
+# Test extraction of a file
+RUN: llvm-ar p %p/Inputs/Big.a evenlen | FileCheck %s -check-prefix=EVENLEN
+EVENLEN: evenlen
----------------
I'm assuming the file contents are "evenlen". I'd actually be inclined to change the contents to be different to the member name. That way, it prevents a bug where the operation is returning the wrong thing.

I think you should also add to the FileCheck line `--implicit-check-not={{.}}`. This will ensure that there is no other output printed, as by default FileCheck looks for sub-strings.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D100651/new/

https://reviews.llvm.org/D100651