[cfe-dev] [RFC] Delayed target-specific diagnostic when compiling for the devices.

Finkel, Hal J. via cfe-dev cfe-dev at lists.llvm.org
Thu Jan 17 11:14:59 PST 2019

On 1/17/19 12:46 PM, Alexey Bataev wrote:
> -------------
> Best regards,
> Alexey Bataev
> 17.01.2019 13:41, Finkel, Hal J. пишет:
>> On 1/17/19 12:18 PM, Alexey Bataev wrote:
>>> -------------
>>> Best regards,
>>> Alexey Bataev
>>> 17.01.2019 13:11, Finkel, Hal J. пишет:
>>>> On 1/17/19 11:44 AM, Alexey Bataev wrote:
>>>>> -------------
>>>>> Best regards,
>>>>> Alexey Bataev
>>>>> 17.01.2019 12:40, Finkel, Hal J. пишет:
>>>>>> On 1/17/19 11:11 AM, Alexey Bataev wrote:
>>>>>>> The compiler does not know anything about the layout on the host when
>>>>>>> it compiles for the device.
>>>>>> No, the compiler does know about the host layout (e.g., can't we
>>>>>> construct this by calling getContext().getAuxTargetInfo(), or similar?).
>>>>> The compiler knows, but the backend is unable to correctly represent
>>>>> unsupported types.
>>>> To any extent to which this is true, this is something that we can fix.
>>>>>>> We cannot do anything with the types that are not supported by the
>>>>>>> target device and we cannot use the layout from the host. And it is
>>>>>>> user responsibility to write and use the code that is compatible with
>>>>>>> the the target devices.
>>>>>>> He/she does not need to use macros/void* types, there are templates.
>>>>>> No. This doesn't solve the problem (because you still need to share the
>>>>>> instantiations between the devices). Also, even if it did, does not
>>>>>> address the legacy-code problem that the feature is intended to address.
>>>>>> The user already has classes and data on the host and wishes to access
>>>>>> *parts* of that data on the device. We should make as much of that work
>>>>>> as possible.
>>>>> Design your code to be compatible with the both, the host and the device.
>>>> Your vantage point here is certainly not the same as mine. From my
>>>> perspective, we already have many hundreds of millions of lines of code
>>>> and we're incrementally poring that code to use OpenMP and accelerators.
>>>> This is not a new top-down design process. We already have data
>>>> structures and we want to make accessing those data structures from the
>>>> accelerator as easy as possible. We should require as little code change
>>>> as possible - we'll have more than enough code change necessary for hot
>>>> code regions, and we'd like to minimize necessary changes elsewhere.
>>>> Anything that we can find a way to support in this regard, we should.
>>> But we should not break the compiler because of this. Plus, don't you
>>> think that we may break code reliability?
>> I'm proposing no such thing. Or, you say "break" and I say "properly
>> implement" :-)
> I don't think that what you're saying is "properly implement". For me
> it sounds like "let's break the compiler in favor of user code".
I'm not understanding why you say this "breaks" the compiler. In what
way would the compiler be broken?

> Or "let's add a hack in favor of user code".
I disagree. I believe the resulting model is more consistent both for us
and for our users. The model is: types are always fine, but specific
operations might not be allowed.

Thanks again,


>> We would still need to produce an error in cases where we'd need to
>> produce an operation that we couldn't lower. We could also produce
>> runtime calls and let linking fail, but an error is better.
>>> Introduce possible security issues in the code emitted by the compiler
>>> that does not take device restrictions into the account?
>> This is not a concern - we'd still fail to produce code that we couldn't
>> lower, but only if we need operations that are problematic, not based on
>> transitive type referencing.
>> Thanks again,
>> Hal
>>>>>>> You cannot use classes, which use types incompatible with the device.
>>>>>>> There is a problem with the data layout on the device and we just
>>>>>>> don't know how to represent such classes on the device.
>>>>>> There's no reason for this to be true. To be clear, the model of a
>>>>>> shared address space only makes sense, from a user perspective, if the
>>>>>> data layout is the same between the host and the target. Not mostly
>>>>>> similar, but the same. Otherwise, users will constantly be tracking down
>>>>>> subtle data-layout incompatibilities.
>>>>> That's why he/she must use only types, which are supported and have
>>>>> the same data layout on the host and on the device. Otherwise, there
>>>>> is data layout incompatibility.
>>>> Again, the model only makes sense if the host and device share
>>>> data-layout rules. This is true even for types on which we cannot
>>>> (efficiently) operate on the accelerator.
>>>> Thanks again,
>>>> Hal
>>>>>> Thanks again,
>>>>>> Hal
>>>>>>> -------------
>>>>>>> Best regards,
>>>>>>> Alexey Bataev
>>>>>>> 17.01.2019 11:47, Finkel, Hal J. пишет:
>>>>>>>> On 1/17/19 9:52 AM, Alexey Bataev wrote:
>>>>>>>>> Because the type is not compatible with the target device.
>>>>>>>> But it's not that simple. The situation is that the programming
>>>>>>>> environment supports the type, but *operations* on that type are not
>>>>>>>> supported in certain contexts (e.g., when compiled for a certain
>>>>>>>> device). As you point out, we already need to move in this explicit
>>>>>>>> direction by, for example, allowing typedefs for types that are not
>>>>>>>> supported in all contexts, function declarations, and so on. In the end,
>>>>>>>> we should allow our users to design their classes and abstractions using
>>>>>>>> good software-engineering practice without worrying about access-context
>>>>>>>> partitioning.
>>>>>>>> Also, the other problem here is that the function I used as an example
>>>>>>>> is a very common C++ idiom. There are a lot of classes with function
>>>>>>>> that return a reference to themselves. Classes can have lots of data
>>>>>>>> members, and those members might not be accessed on the device (even if
>>>>>>>> the class itself might be accessed on the device). We're moving to a
>>>>>>>> world in which unified memory is common - the promise of this technology
>>>>>>>> is that configuration data and complex data structures, which might be
>>>>>>>> occasionally accessed (but for which explicitly managing data movement
>>>>>>>> is not performance relevant) are handled transparently. If use of these
>>>>>>>> data structures is transitively poisoned by use of any type not
>>>>>>>> supported on the device (including by pointers to types that use those
>>>>>>>> types), then we'll force unhelpful and technically-unnecessary
>>>>>>>> refactoring, thus reducing the value of the feature.
>>>>>>>> In the current implementation we pre-process the source twice, and so we
>>>>>>>> can:
>>>>>>>>  1. Use ifdefs to change the data memebers when compiling for different
>>>>>>>> targets. This is hard to get right because, in order to keep the data
>>>>>>>> layout otherwise the same, the user needs to understand the layout rules
>>>>>>>> in order to put something in the structure that is supported on the
>>>>>>>> target and keeps the layout the same (this is very error prone). Also,
>>>>>>>> if we move to a single-preprocessing-stage model, this no longer works.
>>>>>>>>  2. Replace all pointers to relevant types with void*, or similar, and
>>>>>>>> use a lot of casts. This is also bad.
>>>>>>>> We shouldn't be forcing users to play these games. The compiler knows
>>>>>>>> the layout on the host and it can use it on the target. The fact that
>>>>>>>> some operations on some types might not be supported on the target is
>>>>>>>> not relevant to handling pointers/references to containing types.
>>>>>>>> Thanks again,
>>>>>>>> Hal
>>>>>>>>> -------------
>>>>>>>>> Best regards,
>>>>>>>>> Alexey Bataev
>>>>>>>>> 17.01.2019 10:50, Finkel, Hal J. пишет:
>>>>>>>>>> On 1/17/19 9:27 AM, Alexey Bataev wrote:
>>>>>>>>>>> It should be compilable for the device only iff function foo is not used
>>>>>>>>>>> on the device.
>>>>>>>>>> Says whom? I disagree. This function should work on the device. Why
>>>>>>>>>> should it not?
>>>>>>>>>>  -Hal
>>>>>>>>>>> -------------
>>>>>>>>>>> Best regards,
>>>>>>>>>>> Alexey Bataev
>>>>>>>>>>> 17.01.2019 10:24, Finkel, Hal J. пишет:
>>>>>>>>>>>> On 1/17/19 4:05 AM, Alexey Bataev wrote:
>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>> Alexey Bataev
>>>>>>>>>>>>>> 17 янв. 2019 г., в 0:46, Finkel, Hal J. <hfinkel at anl.gov> написал(а):
>>>>>>>>>>>>>>> On 1/16/19 8:45 AM, Alexey Bataev wrote:
>>>>>>>>>>>>>>> Yes, I thought about this. But we need to delay the diagnostic until
>>>>>>>>>>>>>>> the Codegen phase. What I need is the way to associate the diagnostic
>>>>>>>>>>>>>>> with the function so that this diagnostic is available in CodeGen.
>>>>>>>>>>>>>>> Also, we need to postpone the diagnotics not only for functions,
>>>>>>>>>>>>>>> but,for example, for some types. For example, __float128 type is not
>>>>>>>>>>>>>>> supported by CUDA. We can get error messages when we ran into
>>>>>>>>>>>>>>> something like `typedef __float128 SomeOtherType` (say, in some system
>>>>>>>>>>>>>>> header files) and get the error diagnostic when we compile for the
>>>>>>>>>>>>>>> device. Though, actually, this type is not used in the device code,
>>>>>>>>>>>>>>> the diagnostic is still emitted and we need to delay too and emit it
>>>>>>>>>>>>>>> only iff the type is used in the device code.
>>>>>>>>>>>>>> This should be fixed for CUDA too, right?
>>>>>>>>>>>>>> Also, we still get to have pointers to aggregates containing those types
>>>>>>>>>>>>>> on the device, right?
>>>>>>>>>>>>> No, why? This is not allowed and should be diagnosed too. If somebody tries somehow to use not allowed type for the device variables/functions - it should be diagnosed.
>>>>>>>>>>>> Because this should be allowed. If I have:
>>>>>>>>>>>> struct X {
>>>>>>>>>>>>   int a;
>>>>>>>>>>>>   __float128 b;
>>>>>>>>>>>> };
>>>>>>>>>>>> and we have some function which does this:
>>>>>>>>>>>> X *foo(X *x) {
>>>>>>>>>>>>   return x;
>>>>>>>>>>>> }
>>>>>>>>>>>> We'll certainly want this function to compile for all targets, even if
>>>>>>>>>>>> there's no __float128 support on some accelerator. The whole model only
>>>>>>>>>>>> really makes sense if the accelerator shares the aggregate-layout rules
>>>>>>>>>>>> of the host, and this is a needless hassle for users if this causes an
>>>>>>>>>>>> error (especially in a unified-memory environment where configuration
>>>>>>>>>>>> data structures, etc. are shared between devices).
>>>>>>>>>>>> Thanks again,
>>>>>>>>>>>> Hal
>>>>>>>>>>>>>> Thanks again,
>>>>>>>>>>>>>> Hal
>>>>>>>>>>>>>>> -------------
>>>>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>>>> Alexey Bataev
>>>>>>>>>>>>>>> 15.01.2019 17:33, John McCall пишет:
>>>>>>>>>>>>>>>>> On 15 Jan 2019, at 17:20, Alexey Bataev wrote:
>>>>>>>>>>>>>>>>> This is not only for asm, we need to delay all target-specific
>>>>>>>>>>>>>>>>> diagnostics.
>>>>>>>>>>>>>>>>> I'm not saying that we need to move the host diagnostic, only the
>>>>>>>>>>>>>>>>> diagnostic for the device compilation.
>>>>>>>>>>>>>>>>> As for Cuda, it is a little but different. In Cuda the programmer
>>>>>>>>>>>>>>>>> must explicitly mark the device functions,  while in OpenMP it must
>>>>>>>>>>>>>>>>> be done implicitly. Thus, we cannot reuse the solution used for Cuda.
>>>>>>>>>>>>>>>> All it means is that you can't just use the solution used for CUDA
>>>>>>>>>>>>>>>> "off the shelf".  The basic idea of associating diagnostics with the
>>>>>>>>>>>>>>>> current function and then emitting those diagnostics later when you
>>>>>>>>>>>>>>>> realize that you have to emit that function is still completely
>>>>>>>>>>>>>>>> applicable.
>>>>>>>>>>>>>>>> John.
>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>> Hal Finkel
>>>>>>>>>>>>>> Lead, Compiler Technology and Programming Languages
>>>>>>>>>>>>>> Leadership Computing Facility
>>>>>>>>>>>>>> Argonne National Laboratory
Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory

More information about the cfe-dev mailing list