[PATCH] D91460: [AsmParser] make .ascii/.asciz/.string support multiple strings

Mon Nov 16 14:35:20 PST 2020

nickdesaulniers added a comment.

In D91460#2398062 <https://reviews.llvm.org/D91460#2398062>, @jcai19 wrote:

>> https://github.com/ClangBuiltLinux/linux/blob/f01c30de86f1047e9bae1b1b1417b0ce8dcd15b1/arch/arm/probes/kprobes/test-core.h#L114-L116 makes it sounds like `.ascii` and `.asciz` work differently in this regard?
>
> So I played with the two directives a little bit, and I think they both treat a call with multiple string arguments the same as multiple calls with one argument. For example, .ascii produces the same disassembly with `"foo" "bar"`, `"foo", "bar"` or `"foobar"` sine the directive does not append anything after string, while .asciz appends \0 and turns `"foo" "bar"` or `"foo", "bar"` into `"foo\0bar" `.

This looks like confirmation of what I said.

> With .ascii:
>
>   $ cat test_space.s 
>   .ascii "foo" "bar"
>   $ cat test_comma.s 
>   .ascii "foo", "bar"
>   $ cat test_concatenation.s 
>   .ascii "foobar"
>   
>   $ gcc test_space.s -c -o test_space.o
>   $ objdump -dr test_space.o

You can use `llvm-readelf --string-dump=.text test_space.o` to interpret the `.text` section as C style strings, though that might not be helpful for the `.ascii` examples.

In D91460#2395522 <https://reviews.llvm.org/D91460#2395522>, @jrtc27 wrote:

> In D91460#2395521 <https://reviews.llvm.org/D91460#2395521>, @jcai19 wrote:
>
>>> In which case I assume `.asciz "foo" "bar"` is equivalent to `.asciz "foobar"` rather than `.asciz "foo\0bar"` (same for `.string`), just like C string concatenation, and hence why both syntaxes exist?
>>
>> I think GNU assembler would produce the same disassembly with `.asciz "foo" "bar"` and `.asciz "foo", "bar"`.
>>
>>   $ cat test.s 
>>   .asciz "foo" "bar"
>>   .asciz "foo", "bar"
>>   
>>   $ arm-linux-gnueabihf-gcc test.s -c -o test.o
>>   $ arm-linux-gnueabihf-objdump -dr test.o
>>   gcc.o:     file format elf32-littlearm
>>   
>>   
>>   Disassembly of section .text:
>>   
>>   00000000 <.text>:
>>      0:	006f6f66 	.word	0x006f6f66
>>      4:	00726162 	.word	0x00726162
>>      8:	006f6f66 	.word	0x006f6f66
>>      c:	00726162 	.word	0x00726162
>
> In which case that's confusing and likely to lead to bugs if people make use of the preprocessor in the hopes of concatenating strings but end up with NUL bytes being inserted contrary to what there would be in C. Can we not just fix the assembly to use commas rather than this weird syntax that seems to be a special case for `.asciz`? You can't write `.word 2 2`, only `.word 2, 2`, so why do string directives really need special treatment?

I disagree.  That's why `.ascii` is distinct from `.asciz`; if developers do not want `NUL`-terminated C style strings, they should use `.ascii` and not `.asciz`.  The assembler need not match the behavior of the C preprocessor; this patch is about matching the behavior of GNU `as` such that `clang` can be used as a substitute.  Not matching the behavior of GNU `as` precisely here would be a mistake that would hinder the adoption of clang for existing assembler sources.

@jcai19 this patch still needs tests for this new change of behavior to the parser.  You have good test cases in your comments; those should be added into this patch itself.

Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D91460/new/

https://reviews.llvm.org/D91460