[cfe-dev] Difference in generated code between variadic parameter pack and manual version
Bart Samwel via cfe-dev
cfe-dev at lists.llvm.org
Mon Sep 21 04:01:48 PDT 2020
Hi there folks,
I wonder if anybody can shed some light on this. I'm looking at a function
with a parameter pack argument and one without, that should do the exact
same thing.
https://godbolt.org/z/Keqzcj
However, the version with the parameter pack expands (at -O3
-march=broadwell, on clang 10.0.1, on godbolt) into a loop per 128 bytes,
plus a loop per 64 bytes, plus nonvectorized instructions to process the
remaining <=63 bytes. The manual version expands to just a loop per 128
bytes (256-bit vectors, unrolled 4x), and nonvectorized instructions to
process the remaining <=127 bytes.
It's not about the fold expression. I replaced the inner loop of the first
function by:
auto tuple = std::make_tuple(input[i]...);
out[i] = get<0>(tuple) | get<1>(tuple) | get<2>(tuple);
And it generates the same code AFAICT.
It may be about __restrict__ expansion for parameter pack arguments. But I
don't see how __restrict__ could lead to *these* differences.
FWIW, my benchmarks seem to indicate that the variadic version is about 50%
slower. I have no idea why. The instruction order in the inner loop is
different, which may make a difference?
Any clues would be appreciated!
--
Bart Samwel
bart.samwel at databricks.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.llvm.org/pipermail/cfe-dev/attachments/20200921/fbdeef3b/attachment.html>
More information about the cfe-dev
mailing list