[LLVMdev] Inconsistencies or intended behaviour of LLVM IR?

Mon Feb 2 12:43:45 PST 2015

On Mon, Feb 2, 2015 at 9:51 AM, Robin Eklind <carl.eklind at myport.ac.uk>
wrote:

> (forgot to cc the list)
>
> Answers, questions and assumptions are inlined in the response.
>
> If someone with knowledge of the LLVM IR type system could take a look at
> my assumptions below I'd be very happy.
>
> On 01/30/2015 02:24 AM, Sean Silva wrote:
>
>> On Thu, Jan 29, 2015 at 10:42 PM, Robin Eklind <carl.eklind at myport.ac.uk>
>> wrote:
>>
>>  Thank you for reviewing and commiting the patch Sean :) It was the first
>>> one I've ever submitted to LLVM and the whole process was really smooth!
>>> Using Phabricator with GitHub OAuth login was brilliant as it removed one
>>> more step for new contributors. I also feel very happy that the first
>>> patch
>>> ended up removing more code than it introduced :) Not likely to speed up
>>> the compilation process by a lot, but one can hope to keep the trend!
>>>
>>>
>> Great!
>>
>>
>>
>>> I read the blog post about the type system rewrite. Thank you for the
>>> link. It did clear up a lot of my uncertainties, but introduced a new
>>> one.
>>> Could you help me make sense of this part, which was presented under the
>>> "Identified structs have a 1-1 mapping with a name" section.
>>>
>>>  "... and the only types that can be named are identified structs"
>>>>
>>>
>>> Does this mean that other types cannot be named? What about type type
>>> "%x"
>>> in b.ll? It seems like I'm interpreting this in the wrong way. Could you
>>> help me make this clear? Is there a difference between a named type and
>>> an
>>> identified type (or are those two ways of saying the same thing)? If
>>> types
>>> other than structures can be given names, does this name impact type
>>> equality somehow?
>>>
>>>
>> I'll need to punt to someone else for these questions. I haven't dealt
>> with
>> this part of the IR in a while.
>>
>>
>
> Anyone else knowledgeable in this area? I would like to list a set of
> assumptions that I've made after reading the blog post and experimenting
> with the reference implementation. If anyone could verify these
> assumptions, and of cause point out which are incorrect, I'd be very
> grateful.
>
> * Assumption 1 - all types can be given a name, not only structures.
> * Assumption 2 - the type name works as an alias for all types except
> structures, and it is ignored when calculating type equality.
> * Assumption 3 - for structures the type name works as an identity, and
> type equality depends on it.
> * Assumption 4 - type equality is calculated by comparing the base type
> (e.g. the underlying type of a type name identifier) of one type against
> another (recursively and for each element in the case of vectors, arrays
> and other derived types). In the case of identified structures the
> comparison is made strictly based on the structure's name, and in the case
> of structure literals the comparison is made in the same way as for other
> derived types.
>

There are quite a few people on the list that can answer this. Just a
matter of waiting for one of them to pipe up.

>
>
>
>>> To keep up with the spirit of the original topic here are a few more
>>> items
>>> :)
>>>
>>> * Item 11 - hexadecimal integer constants
>>>
>>> The lexer handles hexadecimal integer constants, e.g. from
>>> lib/AsmParser/LLLexer.cpp
>>>
>>>  ///    HexIntConstant  [us]0x[0-9A-Fa-f]+
>>>>
>>>
>>> This representation of integer constants is not mentioned in the language
>>> specification as far as I can tell.
>>>
>>>
>> I assume you are talking about the 'u' and 's' prefix? That seems like a
>> historical artifact. The type system doesn't have signedness so there is
>> no
>> sense in which a constant can be "signed" or "unsigned". In fact, most
>> places that even look at the signedness of the lexer's APSIntVal it's just
>> to issue an error. A patch removing this old cruft would be great.
>>
>>
>
> I'd be happy to remove this old cruft :) Just want to make sure I
> understood correctly. Are you referring to the prefix or the whole
> HexIntConstant representation? Because if we simply remove the prefix it
> would collide with the hexadecimal representation of floating point
> constants.
>

If we don't currently accept 0xDEADBEEF as an integer constant, then it's
probably safe to remove HexIntConstant altogether. That u and s prefixed
stuff is clearly out of date by several years, so clearly nobody is relying
on this if that is the only way to get a hex integer constant.

>
> It seems like clang has been using HexIntConstants in the past (and maybe
> still?), based on the following comment from lib/AsmParser/LLLexer.cpp:
>
> > // Check for [us]0x[0-9A-Fa-f]+ which are Hexadecimal constant generated
> by
> > // the CFE to avoid forcing it to deal with 64-bit numbers.
>
> Is clang still using this representation? If not, I'll start preparing a
> patch to get rid of the HexIntConstant parsing :)
>

I don't think any code inside of clang ever directly writes .ll files; it
all happens via the llvm libraries. So all you need to make sure is that
nowhere inside the llvm libraries will write out .ll which has this
construct.

>
>
>>> * Item 12 - constant expressions
>>>
>>> The documentation of sext states that the bit size of the constant must
>>> be
>>> smaller than the target type, but the implementation also accepts
>>> constants
>>> which have the same size as the target type. E.g. the documentation
>>> should
>>> be updated or the implementation made more strict.
>>>
>>>  sext (CST to TYPE)
>>>>     Sign extend a constant to another type. The bit size of CST must be
>>>>
>>> smaller than the bit size of TYPE. Both types must be integers.
>>>
>>> The same goes for the trunc, zext, sext, fptrunc and fpext operations.
>>> Some refer to larger instead of smaller but none states that types of
>>> equal
>>> size is allowed.
>>>
>>>
>> Probably worth updating the documentation to what is actually allowed by
>> the code. Could you please send a patch to LangRef? (and for convenience,
>> can you point to the relevant source code for citation?).
>>
>>
> I'll try to look into it. So far I've not found this in the source code,
> but rather by examining the behaviour of compiling .ll files with clang.
>

Surely there is somewhere in the llvm libraries where we either reject or
accept (through inaction) extension/truncation to types of the same size.
Maybe the verifier?

>
>
>>> * Item 13 - LocalVar and LocalID for named types
>>>
>>> This is more of a question. Why are types referred to using local names
>>> "%x" instead of global names "@x"? It seems inconsistent as local names
>>> are
>>> scoped to the function; a local variable name in one function refers to a
>>> different value from a local variable name in another. Since types are
>>> scoped to the module wouldn't a global name make more sense?
>>>
>>>
>> I doubt there's a particular rationale. I wouldn't pay too much attention
>> to the sigils. They are pretty much arbitrary and just to make the lexer
>> simpler, similar to using introducer keywords makes the parser simpler.
>>
>> A more concerning inconsistency regarding sigils (if choice of sigils were
>> to be concerning) is the use of the same sigils for types and values.
>> Types
>> are a purely compile-time thing while locals and globals actually
>> correspond to materializable run-time values (slightly muddled by things
>> like dbg.declare and llvm.assume).
>>
>>
> Would it make sense to start a discussion about this inconsistency where
> the same sigil is used for types and values? It the compatibility between
> releases is ensured using the Bitcode format, it may be possible to
> introduce a patch to the assembly representation of LLVM IR. To port old
> files to the new representation one could convert .ll files to .bc using
> the current version of llvm-as, and then convert back using a newer version
> of llvm-dis. I can understand if this is a low priority issue, but
> discussing and fixing any inconsistency in the language makes sense and
> pays off in the long run.
>

I don't think anybody really cares about the sigils. They are just there to
simplify the lexer/parser code. In this case, the complexity of
reconstructing the .ll files *including the FileCheck comments* is probably
not worth it (especially since any mistakes effectively end up silently
reducing our test coverage).

-- Sean Silva

>
>
>>>
>>> As always, I'm eager to hear more about the type system in particular.
>>> The
>>> compilation timed in at 120m36.240s while the test cases took 32m10.111s.
>>> It will be interesting to see if this goes up or down as time passes :)
>>>
>>>
>> Unfortunately probably up. On my main machine in college, a full build of
>> LLVM + Clang took 20 minutes. Last I checked (quite some time ago), that
>> machine took 40 minutes.
>>
>> Also, btw, you can do builddir/bin/llvm-lit llvm/test/path/to/test.ll to
>> run just a single test while iterating (or shell glob a list of tests; or
>> pass a directory). There's also a way to run a subset of the unittests,
>> but
>> I forget it off the top of my head.
>>
>> -- Sean Silva
>>
>>
>>
>>> Cheers /Robin Eklind
>>>
>>>
>>> On 01/28/2015 08:31 PM, Sean Silva wrote:
>>>
>>>  On Wed, Jan 28, 2015 at 6:28 PM, Robin Eklind <carl.eklind at myport.ac.uk
>>>> >
>>>> wrote:
>>>>
>>>>   Hello Sean,
>>>>
>>>>>
>>>>> Thank you for your reply. I'll give your suggestion to item 6 and 7 a
>>>>> try
>>>>> tonight. I'll start a compilation and let it run throughout the night.
>>>>> My
>>>>> laptop (x61s) is 8 years old by know, so compiling LLVM takes a little
>>>>> time
>>>>> :)
>>>>>
>>>>>
>>>>>  This is why I did so much documentation work when in college. The docs
>>>> build much faster.
>>>>
>>>>
>>>>
>>>>  Regarding item 8. I don't know if anyone is using "": in the wild so
>>>>> fixing the implementation might make sense. If not the documentation
>>>>> (e.g.
>>>>> the QuoteLabel comment) should be updated to be in line with the
>>>>> implementation.
>>>>>
>>>>>
>>>>>  FYI the textual IR doesn't have a compatibility guarantee (we try not
>>>> to
>>>> egregiously change it, but users don't expect .ll to work across
>>>> versions).
>>>>
>>>>
>>>>
>>>>  I only included item 9 since I stumbled upon it once cross-referencing
>>>>> the
>>>>> source code with the language specification. Bitrot for a project of
>>>>> this
>>>>> size is to be expected.
>>>>>
>>>>> I'm still very interested to hear about the items related to types,
>>>>> e.g.
>>>>> item 1 and 2. Is there a good reference which describes how type
>>>>> equality
>>>>> works in LLVM IR? If the source code is the reference, could someone
>>>>> with
>>>>> the high level knowledge get me up to speed?
>>>>>
>>>>>
>>>>>  Off the top of my head maybe
>>>> http://blog.llvm.org/2011/11/llvm-30-type-system-rewrite.html
>>>>
>>>>
>>>>
>>>>  Item 1 still confuses me, so I'd be very happy if someone with more
>>>>> insight could clarify if this is the intended behaviour and if so the
>>>>> motivation behind it.
>>>>>
>>>>> As it so happens, I forgot to include item 10 :)
>>>>>
>>>>> * Item 10 - lli vs. clang output
>>>>>
>>>>> Using the same source files as before, it seems like lli and clang
>>>>> treats
>>>>> common linkage and constant variables differently. The following
>>>>> execution
>>>>> demonstrates the return value after executing i.ll, j.ll, k.ll and l.ll
>>>>> with lli and clang respectively:
>>>>>
>>>>>   $ clang i.ll && ./a.out ; echo $?
>>>>>
>>>>>> 37
>>>>>>
>>>>>> $ lli i.ll ; echo $?
>>>>>> 37
>>>>>>
>>>>>>
>>>>>> $ clang j.ll && ./a.out ; echo $?
>>>>>> 0
>>>>>>
>>>>>> $ lli j.ll ; echo $?
>>>>>> 42
>>>>>>
>>>>>>
>>>>>> $ clang k.ll && ./a.out ; echo $?
>>>>>> 37
>>>>>>
>>>>>> $ lli k.ll ; echo $?
>>>>>> 37
>>>>>>
>>>>>>
>>>>>> $ clang l.ll && ./a.out ; echo $?
>>>>>> Segmentation fault
>>>>>> 139
>>>>>>
>>>>>> $ lli l.ll ; echo $?
>>>>>> 37
>>>>>>
>>>>>>
>>>>>
>>>>>  Some of these linkage combinations and operations have dubious
>>>> semantics.
>>>> Talking briefly with Rafael Espindola over a build, sounds like we
>>>> should
>>>> mostly tighten up the verifier to remove some of these weird cases. For
>>>> example, storing to a constant is sort of .... I'm sort of surprised it
>>>> works at all.
>>>>
>>>> -- Sean Silva
>>>>
>>>>
>>>>
>>>>  Looking forward to hear more about type equality, or get a pointer as
>>>>> to
>>>>> where I can read up about it.
>>>>>
>>>>> Cheers /Robin Eklind
>>>>>
>>>>>
>>>>>
>>>>> On 01/28/2015 03:45 PM, Sean Silva wrote:
>>>>>
>>>>>   A couple quick comments inline (didn't touch on all points):
>>>>>
>>>>>>
>>>>>> On Wed, Jan 28, 2015 at 1:49 AM, Robin Eklind <
>>>>>> carl.eklind at myport.ac.uk
>>>>>>
>>>>>>>
>>>>>>>  wrote:
>>>>>>
>>>>>>    Hello everyone!
>>>>>>
>>>>>>
>>>>>>> I've recently had a chance to familiarize myself with the
>>>>>>> nitty-gritty
>>>>>>> details of LLVM IR. It has been a great learning experience,
>>>>>>> sometimes
>>>>>>> frustrating or confusing but mostly rewarding.
>>>>>>>
>>>>>>> There are a few cases I've come across which seems odd to me. I've
>>>>>>> tried
>>>>>>> to cross reference with the language specification and the source
>>>>>>> code
>>>>>>> to
>>>>>>> the best of my abilities, but would like to reach out to an
>>>>>>> experienced
>>>>>>> crowd with a few questions.
>>>>>>>
>>>>>>> Could you help me out by taking a look at these examples? To my
>>>>>>> novice
>>>>>>> eyes they seem to highlight inconsistencies in LLVM IR (or the
>>>>>>> reference
>>>>>>> implementation), but it is quite likely that I've overlooked
>>>>>>> something.
>>>>>>> Please help me out.
>>>>>>>
>>>>>>> Note: the example source files have been attached and a copy is made
>>>>>>> available at https://github.com/mewplay/ll
>>>>>>>
>>>>>>> * Item 1 - named pointer types
>>>>>>>
>>>>>>> It is possible to create a named array pointer type (and many
>>>>>>> others),
>>>>>>> but
>>>>>>> not a named structure pointer type. E.g.
>>>>>>>
>>>>>>> %x = type [1 x i32]* ; valid.
>>>>>>> %x = type {i32}*     ; invalid.
>>>>>>>
>>>>>>> Is this the intended behaviour? Attaching a.ll, b.ll, c.ll and d.ll
>>>>>>> for
>>>>>>> reference. All files except d.ll compiles without error using clang
>>>>>>> version
>>>>>>> 3.5.1 (tags/RELEASE_351/final).
>>>>>>>
>>>>>>>    $ clang d.ll
>>>>>>>
>>>>>>>  d.ll:3:16: error: expected top-level entity
>>>>>>>> %x = type {i32}*
>>>>>>>>                   ^
>>>>>>>> 1 error generated.
>>>>>>>>
>>>>>>>>
>>>>>>>>  Does it have anything to do with type equality? (just a hunch)
>>>>>>>
>>>>>>> * Item 2 - equality of named types
>>>>>>>
>>>>>>> A named integer type is equivalent to its literal type counterpart,
>>>>>>> but
>>>>>>> the same is not true for named and literal structures. I am certain
>>>>>>> that
>>>>>>> I've read about this before, but can't seem to locate the right
>>>>>>> section
>>>>>>> of
>>>>>>> the language specification; could anyone point me in the right
>>>>>>> direction?
>>>>>>> Also, what is the motivation behind this decision? I've skimmed over
>>>>>>> the
>>>>>>> code which handles named structure types (in lib/IR/core.cpp), but
>>>>>>> would
>>>>>>> love to hear the high level idea.
>>>>>>>
>>>>>>> Attaching e.ll, f.ll, g.ll and h.ll for reference. All compile just
>>>>>>> file
>>>>>>> except h.ll, which produces the following error message (using the
>>>>>>> same
>>>>>>> version of clang as above):
>>>>>>>
>>>>>>>    $ clang h.ll
>>>>>>>
>>>>>>>  h.ll:10:23: error: argument is not of expected type '%x = type { i32
>>>>>>>> }'
>>>>>>>>            call void (%x)* @foo({i32} {i32 0})
>>>>>>>>                                 ^
>>>>>>>> 1 error generated.
>>>>>>>>
>>>>>>>>
>>>>>>>>  * Item 3 - zero initialized common linkage variables
>>>>>>>
>>>>>>> According to the language specification common linkage variables are
>>>>>>> required to have a zero initializer [1]. If so, why are they also
>>>>>>> required
>>>>>>> to provide an initial value?
>>>>>>>
>>>>>>> Attaching i.ll and j.ll for reference. Both compiles just fine and
>>>>>>> once
>>>>>>> executed i.ll returns 37 and j.ll return 0. If the common linkage
>>>>>>> variable
>>>>>>> @x was not initialized to 0, j.ll would have returned 42.
>>>>>>>
>>>>>>> * Item 4 - constant common linkage variables
>>>>>>>
>>>>>>> The language specification states that common linkage variables may
>>>>>>> not
>>>>>>> be
>>>>>>> marked as constant [1]. The parser doesn't seem to enforce this
>>>>>>> restriction. Would doing so cause any problems?
>>>>>>>
>>>>>>> Attaching k.ll and l.ll for reference. Both compiles just fine, but
>>>>>>> once
>>>>>>> executed k.ll returns 37 (e.g. the constant variable was overwritten)
>>>>>>> while
>>>>>>> l.ll segfaults as expected when it tries to overwrite a read-only
>>>>>>> memory
>>>>>>> location.
>>>>>>>
>>>>>>> * Item 5 - appending linkage restrictions
>>>>>>>
>>>>>>> An extract from the language specification [1]:
>>>>>>>
>>>>>>>    "appending" linkage may only be applied to global variables of
>>>>>>> pointer
>>>>>>>
>>>>>>>
>>>>>>>>   to array type.
>>>>>>>>
>>>>>>>
>>>>>>> Similarly to item 4 this restriction isn't enforced by the parser.
>>>>>>> Would
>>>>>>> it make sense doing so, or is there any problem with such an
>>>>>>> approach?
>>>>>>>
>>>>>>> * Item 6 - hash token
>>>>>>>
>>>>>>> The hash token (#) is defined in lib/AsmParser/LLToken.h (release
>>>>>>> version
>>>>>>> 3.5.0 of the LLVM source code) but doesn't seem to be used anywhere
>>>>>>> else
>>>>>>> in
>>>>>>> the source tree. Is this token a historical artefact or does it
>>>>>>> serve a
>>>>>>> purpose?
>>>>>>>
>>>>>>>
>>>>>>>   Try deleting it. If the tests pass send a patch. Same for item 7.
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>   * Item 7 - backslash token
>>>>>>
>>>>>>>
>>>>>>> Similarly to item 7 the backslash token doesn't seem to serve a
>>>>>>> purpose
>>>>>>> (with regards to release version 3.5.0 of the LLVM source code). Is
>>>>>>> it
>>>>>>> used
>>>>>>> somewhere?
>>>>>>>
>>>>>>> * Item 8 - quoted labels
>>>>>>>
>>>>>>> A comment in lib/AsmParser/LLLexer.cpp (once again, release version
>>>>>>> 3.5.0
>>>>>>> of the LLVM source code) describes quoted labels using the following
>>>>>>> regexp
>>>>>>> (e.g. at least one character between the double quotes):
>>>>>>>
>>>>>>>    ///   QuoteLabel        "[^"]+":
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>  In contrast the reference implementation accepts quoted labels with
>>>>>>> zero
>>>>>>> or more characters between the double quotes. Which is to be trusted?
>>>>>>> The
>>>>>>> comment makes more sense as the variable name would effectively be
>>>>>>> blank
>>>>>>> otherwise.
>>>>>>>
>>>>>>>
>>>>>>>   Looks an empty name just results in the thing becoming unnamed.
>>>>>>> That's
>>>>>>>
>>>>>> sort
>>>>>> of confusing, but probably not harmful. Maybe we use an empty name as
>>>>>> a
>>>>>> sentinel for "unnamed", so it sort of just was an accident of the
>>>>>> implementation.
>>>>>>
>>>>>>
>>>>>>
>>>>>>   * Item 9 - undocumented calling conventions
>>>>>>
>>>>>>>
>>>>>>> The following calling conventions are valid tokens but not described
>>>>>>> in
>>>>>>> the language references as of revision 223189:
>>>>>>>
>>>>>>> intel_ocl_bicc, x86_stdcallcc, x86_fastcallcc, x86_thiscallcc,
>>>>>>> kw_x86_vectorcallcc, arm_apcscc, arm_aapcscc, arm_aapcs_vfpcc,
>>>>>>> msp430_intrcc, ptx_kernel, ptx_device, spir_kernel, spir_func,
>>>>>>> x86_64_sysvcc, x86_64_win64cc, kw_ghccc
>>>>>>>
>>>>>>>
>>>>>>>    This is just bitrot.
>>>>>>>
>>>>>>>
>>>>>> -- Sean Silva
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>  Lastly I'd just like to thank the LLVM developers for all the time
>>>>>>> and
>>>>>>> hard work they've put into this project. I'd especially like to thank
>>>>>>> you
>>>>>>> for providing a language specification along side of the reference
>>>>>>> implementation! Keeping it up to date is a huge task, but also hugely
>>>>>>> important. Thank you!
>>>>>>>
>>>>>>> Kind regards
>>>>>>> /Robin Eklind
>>>>>>>
>>>>>>> [1]: http://llvm.org/docs/LangRef.html#linkage-types
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> LLVM Developers mailing list
>>>>>>> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
>>>>>>> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>  _______________________________________________
> LLVM Developers mailing list
> LLVMdev at cs.uiuc.edu         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/llvm-dev/attachments/20150202/1b9cec1e/attachment.html>