<div dir="ltr"><div dir="ltr"><div dir="ltr"><div class="gmail_default" style="font-family:verdana,sans-serif">Hi, Bryce,</div><div class="gmail_default" style="font-family:verdana,sans-serif"><br></div><div class="gmail_default" style="font-family:verdana,sans-serif"><span style="font-family:Arial,Helvetica,sans-serif">On Fri, Oct 26, 2018 at 6:04 PM Bryce Lelbach via cfe-dev <<a href="mailto:cfe-dev@lists.llvm.org">cfe-dev@lists.llvm.org</a>> wrote:</span><br></div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div dir="ltr">
<div id="gmail-m_-790310077706743992divtagdefaultwrapper" style="font-size:12pt;color:rgb(0,0,0);font-family:Consolas,Courier,monospace" dir="ltr">
Today, CUDA C++ has a macro that can be used to distinguish which architecture<br>
(either the host architecture, a specific device architecture, or any device<br>
architecture) code is currently being compiled for.<br>
<br>
When CUDA code is compiled for the host, __CUDA_ARCH__ is not defined. When it<br>
is compiled for the device, it is defined to a value that indicates the SM architecture.<br>
<br>
At face value, this seems like a useful way to customize how heterogeneous code<br>
is implemented on a particular architecture:<br>
<br>
__host__ __device__<br>
uint32_t iadd3(uint32_t x, uint32_t y, uint32_t z) {<br>
#if __CUDA_ARCH__ >= 200<br>
asm ("vadd.u32.u32.u32.add %0, %1, %2, %3;" : "=r"(x) : "r"(x), "r"(y), "r"(z));<br>
#else<br>
x = x + y + z;<br>
#endif<br>
return x;<br>
}<br>
<br>
However, __CUDA_ARCH__ is only well suited to a split compilation CUDA compiler,<br>
like NVCC, which uses a separate host compiler (GCC, Clang, MSVC, etc) and device<br>
compiler, preprocessing and compiling your code once for each target architecture<br>
(once for the host, and one time for each target device architecture).<br>
<br>
__CUDA_ARCH__ has some caveats, however. The NVCC compiler has to see all kernel<br>
function declarations (e.g. __global__ functions) during both host and device<br>
compilation, to generate the host side launch stubs and the actual device side<br>
kernel code. Otherwise, NVCC may not compile the device side kernel code, either<br>
because it believes it is unused or because it is never instantiated (in the case<br>
of a template kernel function). This, regretably, will not fail at compile time, <br>
but instead fails at runtime when you attempt to launch the (non-existant) kernel.<br>
<br>
Consider the following code. It unconditionally calls `parallel::reduce_n_impl`<br>
on the host, which instantiates some (unseen) template kernel functions during<br>
host compilation. However, in device code, if THRUST_HAS_CUDART is false, <br>
`parallel::reduce_n_impl` is never instantiated and the actual device code for<br>
the kernel functions are never compiled.<br>
<br></div></div></blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div id="gmail-m_-790310077706743992divtagdefaultwrapper" style="font-size:12pt;color:rgb(0,0,0);font-family:Consolas,Courier,monospace" dir="ltr"> #if !defined(__CUDA_ARCH__) || (__CUDA_ARCH__>= 350 && defined(__CUDACC_RDC__))<br>
// We're either not compiling as device code, or we are compiling as device<br>
// code and we can launch kernels from device code (SM 3.5 and higher + <br>
// relocatable device code is required for the device side runtime which is<br>
// needed to do device side launches). <br>
# define THRUST_HAS_CUDART 1<br>
#else<br>
# define THRUST_HAS_CUDART 0<br>
#endif<br>
<br>
namespace thrust {<br>
<br>
#pragma nv_exec_check_disable<br>
template <typename Derived,<br>
typename InputIt,<br>
typename Size,<br>
typename T,<br>
typename BinaryOp><br>
__host__ __device__<br>
T reduce_n(execution_policy<Derived>& policy,<br>
InputIt first,<br>
Size num_items,<br>
T init,<br>
BinaryOp binary_op)<br>
{<br>
// Broken version:<br>
#if THRUST_HAS_CUDART<br>
return system::cuda::reduce_n_impl(policy, first, num_items, init, binary_op);<br>
#else<br>
// We are running on the device and there is no device side runtime, so we<br>
// can't launch a kernel to do the reduction in parallel. Instead, we just<br>
// do a sequential reduction in the calling thread.<br>
return system::sequential::reduce_n_impl(first, num_items, init, binary_op);<br>
#endif<br>
}<br>
<br>
} // namespace thrust<br>
<br></div></div></blockquote><div><div class="gmail_default" style="font-family:verdana,sans-serif"></div><div><br></div><div><div class="gmail_default" style="font-family:verdana,sans-serif">CUDA programming guide says "don't do it":</div></div><div><a href="https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-arch-macro">https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-arch-macro</a><br></div><div><div class="gmail_default" style="font-family:verdana,sans-serif">"<span style="color:rgb(0,0,0);font-family:"Trebuchet MS","DIN Pro",sans-serif;font-size:14px">If a </span><samp class="gmail-ph gmail-codeph" style="font-family:Consolas,Courier,"Courier New",monospace;color:rgb(34,68,0);background:rgb(244,247,240);padding:0px 0.2em;margin-bottom:1em;font-size:14px;border:0px">__global__</samp><span style="color:rgb(0,0,0);font-family:"Trebuchet MS","DIN Pro",sans-serif;font-size:14px"> function template is instantiated and launched from the host, then the function template must be instantiated with the same template arguments irrespective of whether </span><samp class="gmail-ph gmail-codeph" style="font-family:Consolas,Courier,"Courier New",monospace;color:rgb(34,68,0);background:rgb(244,247,240);padding:0px 0.2em;margin-bottom:1em;font-size:14px;border:0px">__CUDA_ARCH__</samp><span style="color:rgb(0,0,0);font-family:"Trebuchet MS","DIN Pro",sans-serif;font-size:14px"> is defined and regardless of the value of </span><samp class="gmail-ph gmail-codeph" style="font-family:Consolas,Courier,"Courier New",monospace;color:rgb(34,68,0);background:rgb(244,247,240);padding:0px 0.2em;margin-bottom:1em;font-size:14px;border:0px">__CUDA_ARCH__</samp><span style="color:rgb(0,0,0);font-family:"Trebuchet MS","DIN Pro",sans-serif;font-size:14px">.</span>"</div></div><div><br></div><div><div class="gmail_default" style="font-family:verdana,sans-serif">So, yes, you shouldn't do it with nvcc. </div></div></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div id="gmail-m_-790310077706743992divtagdefaultwrapper" style="font-size:12pt;color:rgb(0,0,0);font-family:Consolas,Courier,monospace" dir="ltr">
Instead, we end up using the rather odd pattern of adding a (non-constexpr) if<br>
statement whose condition is known at compile time. This ensures the kernel function<br>
is instantiated during device compilation, even though it is not actually used.<br>
Fortunately, while NVCC can as-if optimize away the if statement, it cannot treat<br>
the instantiation as unused.<br>
<br>
#pragma nv_exec_check_disable<br>
template <typename Derived,<br>
typename InputIt,<br>
typename Size,<br>
typename T,<br>
typename BinaryOp><br>
__host__ __device__<br>
T reduce_n(execution_policy<Derived>& policy,<br>
InputIt first,<br>
Size num_items,<br>
T init,<br>
BinaryOp binary_op)<br>
{<br>
if (THRUST_HAS_CUDART)<br>
return parallel::reduce_n_impl(policy, first, num_items, init, binary_op);<br>
<br>
#if !THRUST_HAS_CUDART<br>
// We are running on the device and there is no device side runtime, so we<br>
// can't launch a kernel to do the reduction in parallel. Instead, we just<br>
// do a sequential reduction in the calling thread.<br>
return sequential::reduce_n_impl(first, num_items, init, binary_op);<br>
#endif<br>
}<br>
<br>
For more background, see:<br>
<br>
<a href="https://github.com/NVlabs/cub/issues/30" target="_blank">https://github.com/NVlabs/cub/issues/30</a><br>
<a href="https://stackoverflow.com/questions/51248770/cuda-arch-flag-with-thrust-execution-policy" target="_blank">https://stackoverflow.com/questions/51248770/cuda-arch-flag-with-thrust-execution-policy</a><br>
<br></div></div></blockquote><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div id="gmail-m_-790310077706743992divtagdefaultwrapper" style="font-size:12pt;color:rgb(0,0,0);font-family:Consolas,Courier,monospace" dir="ltr">For a merged parse CUDA compiler, like Clang CUDA, __CUDA_ARCH__ is a poor fit, <br>
because as a textual macro it can be used to completely change the code that <br>
the compiler consumes during host and device compilation, essentially forcing<br>
separate preprocessing and parsing.<br></div></div></blockquote><div><br></div><div><div class="gmail_default" style="font-family:verdana,sans-serif">Yes. One of the ideas I had was to abandon it altogether, parse the source once and then codegen host and device IR from the same AST. Alas, the goal was to compile the existing CUDA code, so we has to keep __CUDA_ARCH__ working and because of that we also have to parse the sources for each host and device compilation.</div><div class="gmail_default" style="font-family:verdana,sans-serif"><br></div><div class="gmail_default" style="font-family:verdana,sans-serif">If __CUDA_ARCH__ is gone, it would make it possible to do more neat things.</div></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div id="gmail-m_-790310077706743992divtagdefaultwrapper" style="font-size:12pt;color:rgb(0,0,0);font-family:Consolas,Courier,monospace" dir="ltr">
<br>
Clang CUDA offers one alternative today, __host__ / __device__ overloading,<br>
which is better suited to a merged parse model:<br>
<br>
__device__<br>
uint32_t iadd3(uint32_t x, uint32_t y, uint32_t z) {<br>
asm ("vadd.u32.u32.u32.add %0, %1, %2, %3;" : "=r"(x) : "r"(x), "r"(y), "r"(z));<br>
return x;<br>
}<br>
<br>
__host__ <br>
uint32_t iadd3(uint32_t x, uint32_t y, uint32_t z) {<br>
return x + y + z;<br>
}<br>
<br>
However, this approach does not allow us to customize code for specific device<br>
architectures. Note that the above code will not compile on SM 1.0 devices, as<br>
the inline assembly contains instructions unavailable on those platforms.<br></div></div></blockquote><div><br></div><div><div class="gmail_default" style="font-family:verdana,sans-serif">Correct. But now that we have ability for the source for two different targets (host&device) to co-exist in principle, it should be relatively easy to extend it to (host + N devices). </div></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div id="gmail-m_-790310077706743992divtagdefaultwrapper" style="font-size:12pt;color:rgb(0,0,0);font-family:Consolas,Courier,monospace" dir="ltr">
<br>
Tuning for specific device architectures is critical for high performance CUDA<br>
libraries, like Thrust. We need to be able to select different algorithms and<br>
use architecture specific facilities to get speed of light performance.<br>
<br>
Fortunately, there is some useful prior art. Clang (and GCC) has a related feature,<br>
__attribute__((target("..."))), which can be used to define a function "overloaded"<br>
on the architecture it is compiled for. One common use case for this feature is<br>
implementing functions that utilize micro-architecture specific CPU SIMD<br>
instructions:<br>
<br>
using double4 = double __attribute__((__vector_size__(32)));<br>
<br>
__attribute__((target("sse")))<br>
double4 fma(double4d x, double4 y, double4 z);<br>
<br>
__attribute__((target("avx")))<br>
double4 fma(double4d x, double4 y, double4 z);<br>
<br>
__attribute__((target("default")))<br>
double4 fma(double4d x, double4 y, double4 z); // "Fallback" implementation.<br>
<br>
This attribute can also be used to target specific architectures:<br>
<br>
__attribute__((target("arch=atom")))<br>
void foo(); // Will be called on 'atom' processors.<br>
<br>
__attribute__((target("default")))<br>
void foo(); // Will be called on any other processors.<br>
<br>
This could easily be extended for heterogeneous compilation:<br>
<br>
__attribute__((target("host:arch=skylake")))<br>
void foo();<br>
<br>
__attribute__((target("arch=atom")))<br>
void foo(); // Implicitly "host:arch=atom".<br>
<br>
__attribute__((target("host:default")))<br>
void foo();<br>
<br>
__attribute__((target("device:arch=sm_20")))<br>
void foo();<br>
<br>
__attribute__((target("device:arch=sm_60")))<br>
void foo();<br>
<br>
__attribute__((target("device:default")))<br>
void foo();<br>
<br>
Or, perhaps more concisely, we could introduce this shorthand:<br>
<br>
__host__("arch=skylake")<br>
void foo();<br>
<br>
__host__<br>
void foo(); // Implicitly "host:default".<br>
<br>
__device__("arch=sm_20")<br>
void foo();<br>
<br>
__device__("arch=sm_60")<br>
void foo();<br>
<br>
__device__ // Implicitly "device:default".<br>
void foo();<br>
<br></div></div></blockquote><div><br></div><div><div class="gmail_default" style="font-family:verdana,sans-serif">This could be one way to do it, though __attribute__((target)) has existing semantics that does not quite match the use case you're proposing. I.e. currently the intent is to generate the code for a different variant of the architecture we're compiling for and that it's possible to incorporate the generated code for all the variants in the TU into the same object file. In PTX, we can't do the same. I.e. in your example above, we can't incorporate sm_60 PTX/SASS into the same object file with sm_20.</div><div class="gmail_default" style="font-family:verdana,sans-serif"><br></div><div class="gmail_default" style="font-family:verdana,sans-serif">If I understand your proposal correctly, instead of generating code for multiple different targets, you want to skip the targets you can't generate code during particular compilation. I.e. it still implies one compilation pass per device. If that's the case, then it does not change the current overloading scheme much as any given compilation will only deal with host + 1 device, and the code targeting incompatible devices would be ignored.</div><div class="gmail_default" style="font-family:verdana,sans-serif"><br></div><div class="gmail_default" style="font-family:verdana,sans-serif">So, Extending __device__ attribute and specializing it for particular target seems plausible, but it probably should not be based on __attribute__((target))</div><div class="gmail_default" style="font-family:verdana,sans-serif"><br></div></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div id="gmail-m_-790310077706743992divtagdefaultwrapper" style="font-size:12pt;color:rgb(0,0,0);font-family:Consolas,Courier,monospace" dir="ltr">
Another place that we use _CUDA_ARCH__ today in Thrust and CUB is in<br>
metaprogramming code that selects the correct "strategies" that should be<br>
used to implement a particular algorithm:<br>
<br>
enum arch {
<div> host,<br>
sm_30, sm_32, sm_35, // Kepler<br>
sm_50, sm_52, sm_53, // Maxwell<br>
sm_60, sm_61, sm_62, // Pascal<br>
sm_70, // Volta<br>
sm_72, sm_75 // Turing<br>
};<br>
<div><br>
</div>
<div> __host__ __device__<br>
constexpr arch select_arch()<br>
{<br>
switch (__CUDA_ARCH__)<br>
{<br>
// ...<br>
}; <br>
}<br>
<br>
template <class T, arch Arch = select_arch()><br>
struct radix_sort_tuning;<br>
<br>
template <class T><br>
struct radix_sort_tuning<T, sm_35><br>
{<br>
constexpr size_t INPUT_SIZE = sizeof(T);<br>
<br>
constexpr size_t NOMINAL_4B_ITEMS_PER_THREAD = 11;<br>
constexpr size_t ITEMS_PER_THREAD<br>
= std::min(NOMIMAL_4B_ITEMS_PER_THREAD,<br>
std::max(1, (NOMIMAL_4B_ITEMS_PER_THREAD * 4 / INPUT_SIZE)));<br>
<br>
constexpr size_t BLOCK_THREADS = 256;<br>
constexpr auto BLOCK_LOAD_STRATEGY = BLOCK_LOAD_WARP_TRANSPOSE;<br>
constexpr auto CACHE_LOAD_STRATEGY = LOAD_LDG;<br>
constexpr auto BLOCK_STORE_STRATEGY = BLOCK_STORE_WARP_TRANSPOSE;<br>
};<br>
<br>
template <typename T><br>
struct radix_sort_tuning<T, sm_50> { /* ... */ };<br>
<br>
// ...<br>
<br>
With heterogeneous target attributes, we could implement select_arch like<br>
so:<br>
<br>
<span style="font-family:Consolas,Courier,monospace,EmojiFont,"Apple Color Emoji","Segoe UI Emoji",NotoColorEmoji,"Segoe UI Symbol","Android Emoji",EmojiSymbols;font-size:16px"> __host__</span><br style="font-family:Consolas,Courier,monospace,EmojiFont,"Apple Color Emoji","Segoe UI Emoji",NotoColorEmoji,"Segoe UI Symbol","Android Emoji",EmojiSymbols;font-size:16px">
<span style="font-family:Consolas,Courier,monospace,EmojiFont,"Apple Color Emoji","Segoe UI Emoji",NotoColorEmoji,"Segoe UI Symbol","Android Emoji",EmojiSymbols;font-size:16px"> constexpr arch select_arch() { return host; }</span><br>
<br>
__device__("arch=sm_30")<br>
constexpr arch select_arch() { return sm_30; }<br>
<br>
__device__("arch=sm_35")<br>
constexpr arch select_arch() { return sm_35; }<br>
<br>
// ...<br>
<br>
You could also potentially use this with if constexpr:
<div><br>
</div>
<div> void foo()</div>
<div> {</div>
<div> // Moral equivalent of #if __CUDA_ARCH__</div>
<div> if constexpr (host != select_arch())</div>
<div> // ...</div>
<div> else</div>
<div> // ...</div>
<div> }<br>
<br></div></div></div></div></div></blockquote><div><br></div><div><div class="gmail_default" style="font-family:verdana,sans-serif">We could probably make select_arch() a constexpr builtin function returning current compilation target.</div></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div id="gmail-m_-790310077706743992divtagdefaultwrapper" style="font-size:12pt;color:rgb(0,0,0);font-family:Consolas,Courier,monospace" dir="ltr"><div><div><div>
This feature would also make it much easier to port some of the more tricky parts<br>
of libc++ to GPUs, such as iostreams and concurrency primitives.<br>
<br>
It would be awesome if we could take __host__ / __device__ overloading a step<br>
further and make it a full fledged replacement for __CUDA_ARCH__. It would provide<br>
a possible future migration path away from __CUDA_ARCH__, which would enable us to<br>
move to true merged parsing for heterogeneous C++: preprocess once, parse once,<br>
perform platform-agnostic optimizations once, code gen multiple times.<br></div></div></div></div></div></blockquote><div><br></div><div><div class="gmail_default" style="font-family:verdana,sans-serif">+100.</div><div class="gmail_default" style="font-family:verdana,sans-serif"><br></div></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div id="gmail-m_-790310077706743992divtagdefaultwrapper" style="font-size:12pt;color:rgb(0,0,0);font-family:Consolas,Courier,monospace" dir="ltr"><div><div><div>
<br>
So, questions:<br>
<div><br>
</div>
<div><span style="font-family:Consolas,Courier,monospace,EmojiFont,"Apple Color Emoji","Segoe UI Emoji",NotoColorEmoji,"Segoe UI Symbol","Android Emoji",EmojiSymbols;font-size:16px">- Can target attributes go on constexpr functions today?</span></div></div></div></div></div></div></blockquote><div><span class="gmail_default" style="font-family:verdana,sans-serif">Godbolt says yes, but the function is no longer behaves as a constexpr</span> <span class="gmail_default" style="font-family:verdana,sans-serif">as it must be dispatched as ifunc: <a href="https://godbolt.org/z/64I_SJ">https://godbolt.org/z/64I_SJ</a></span></div><div><span class="gmail_default" style="font-family:verdana,sans-serif">In any case, I don't think target attributes are a good match for this. </span> <span class="gmail_default" style="font-family:verdana,sans-serif">We should probably extend 'device' instead. </span><br></div><div><span class="gmail_default" style="font-family:verdana,sans-serif"></span></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div id="gmail-m_-790310077706743992divtagdefaultwrapper" style="font-size:12pt;color:rgb(0,0,0);font-family:Consolas,Courier,monospace" dir="ltr"><div><div><div>
- Does anyone have suggestions for how this approach could be improved? Alternatives?</div></div></div></div></div></blockquote><div><div class="gmail_default" style="font-family:verdana,sans-serif">Perhaps we could consider using SFINAE to specialize device-side functions.</div><div class="gmail_default" style="font-family:verdana,sans-serif">E.g. something along the lines of</div><div class="gmail_default" style="font-family:verdana,sans-serif">enable_if_t<current_arch()==350, void> foo(){sm_35 code;}</div><div class="gmail_default" style="font-family:verdana,sans-serif">enable_if_t<current_arch()==600, void> foo(){sm_60 code;}</div><div class="gmail_default" style="font-family:verdana,sans-serif"><span style="font-family:Arial,Helvetica,sans-serif"> </span></div></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div id="gmail-m_-790310077706743992divtagdefaultwrapper" style="font-size:12pt;color:rgb(0,0,0);font-family:Consolas,Courier,monospace" dir="ltr"><div><div>
<div>- Is there interest in this in Clang CUDA?</div></div></div></div></div></blockquote><div><br></div><div><div class="gmail_default" style="font-family:verdana,sans-serif">This is a very good question. </div><div class="gmail_default" style="font-family:verdana,sans-serif">My (anecdotal) observation is that most of the GPU cycles these days are spent in the code written by NVIDIA itself. Everybody seems to be writing wrappers over cuDNN/cuBLAS/cuFFT, so interest in writing howe-grown CUDA code has diminished a lot -- what's the point if NVIDIA's libraries are faster and beating it is unlikely.</div><div class="gmail_default" style="font-family:verdana,sans-serif"><br></div><div class="gmail_default" style="font-family:verdana,sans-serif">If we had a GPU back-end which would be capable of generating native GPU code that can achieve peak performance, that could help resurrecting interest in writing CUDA code again.</div><div class="gmail_default" style="font-family:verdana,sans-serif"><br></div><div class="gmail_default" style="font-family:verdana,sans-serif">Ironically, it may be AMD that may make this happen. There's been a lot of progress lately in getting clang to compile HIP (~CUDA) for AMD GPUs and making it work with TensorFlow.</div><div class="gmail_default" style="font-family:verdana,sans-serif"><br></div><div class="gmail_default" style="font-family:verdana,sans-serif">--Artem<br></div></div><div><br></div><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div id="gmail-m_-790310077706743992divtagdefaultwrapper" style="font-size:12pt;color:rgb(0,0,0);font-family:Consolas,Courier,monospace" dir="ltr"><div><div>
<div></div>
<div><br>
------------------------------------------------------<br>
Bryce Adelstein Lelbach aka wash<br>
ISO C++ LEWGI Chair<br>
CppCon and C++Now Program Chair<br>
Thrust Maintainer, HPX Developer<br>
CUDA Convert and Reformed AVX Junkie<br>
<br>
Ask "Dumb" Questions<br>
------------------------------------------------------<br>
</div>
</div>
</div>
</div>
<div>
<hr>
</div>
<div>This email message is for the sole use of the intended recipient(s) and may
contain confidential information. Any unauthorized review, use, disclosure
or distribution is prohibited. If you are not the intended recipient,
please contact the sender by reply email and destroy all copies of the original
message. </div>
<div>
<hr>
</div>
</div>
_______________________________________________<br>
cfe-dev mailing list<br>
<a href="mailto:cfe-dev@lists.llvm.org" target="_blank">cfe-dev@lists.llvm.org</a><br>
<a href="http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev" rel="noreferrer" target="_blank">http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev</a><br>
</blockquote></div><br clear="all"><div><br></div>-- <br><div dir="ltr" class="gmail_signature"><div dir="ltr">--Artem Belevich</div></div></div></div></div>