[cfe-dev] Difference in generated code between variadic parameter pack and manual version

Mon Sep 21 04:01:48 PDT 2020

Hi there folks,

I wonder if anybody can shed some light on this. I'm looking at a function
with a parameter pack argument and one without, that should do the exact
same thing.

https://godbolt.org/z/Keqzcj

However, the version with the parameter pack expands (at -O3
-march=broadwell, on clang 10.0.1, on godbolt) into a loop per 128 bytes,
plus a loop per 64 bytes, plus nonvectorized instructions to process the
remaining <=63 bytes. The manual version expands to just a loop per 128
bytes (256-bit vectors, unrolled 4x), and nonvectorized instructions to
process the remaining <=127 bytes.

It's not about the fold expression. I replaced the inner loop of the first
function by:

auto tuple = std::make_tuple(input[i]...);
out[i] = get<0>(tuple) | get<1>(tuple) | get<2>(tuple);

And it generates the same code AFAICT.

It may be about __restrict__ expansion for parameter pack arguments. But I
don't see how __restrict__ could lead to *these* differences.

FWIW, my benchmarks seem to indicate that the variadic version is about 50%
slower. I have no idea why. The instruction order in the inner loop is
different, which may make a difference?

Any clues would be appreciated!

-- 
Bart Samwel
bart.samwel at databricks.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20200921/fbdeef3b/attachment.html>