[PATCH] D10991: [LNT] Reduce I/O execution time for Polybench
Kristof Beyls
kristof.beyls at arm.com
Tue Jul 7 09:50:43 PDT 2015
================
Comment at: SingleSource/Benchmarks/Polybench/stencils/seidel-2d/seidel-2d.c:46
@@ +45,3 @@
+ for (j = 0; j < n; j++)
+ print_element(A[i][j], j*8, printmat);
+ fputs(printmat, stderr);
----------------
Do I understand correctly that this code basically only prints out the values of the last row of the entire matrix (the offset is j*8)? I think we'd want whatever the hash function implementation we end up with to still take all elements as input, to improve the chance of detecting a mis-compilation.
I think the hash function can be really simple - no need for anything complex or secure; but we probably should feed in all matrix elements into the hash function. Maybe the straightforward solution here is to just print out the sum of all elements in a row, rather than each element in the row?
Tobias may now these tests better: are we expecting bit-reproducible results for these tests? I'm guessing so unless DATA_PRINTF_MODIFIER in the original code was chosen so that it prints out with less precision?
================
Comment at: SingleSource/Benchmarks/Polybench/utilities/polybench.h:609-630
@@ -608,3 +608,24 @@
-
+/* To avoid calling printf M*M times (and make it run
+ for a long time), we split the output into an encoded string,
+ and print it as a simple char pointer, M times.*/
+static inline
+void print_element(float el, int pos, char *out)
+{
+ union {
+ float datum;
+ char bytes[4];
+ } block;
+
+ block.datum = el;
+ /* each nibble as a char, within the printable range */
+ *(out+pos) = (block.bytes[0]&0xF0>>4)+'0';
+ *(out+pos+1) = (block.bytes[0]&0x0F) +'0';
+ *(out+pos+2) = (block.bytes[1]&0xF0>>4)+'0';
+ *(out+pos+3) = (block.bytes[1]&0x0F) +'0';
+ *(out+pos+4) = (block.bytes[2]&0xF0>>4)+'0';
+ *(out+pos+5) = (block.bytes[2]&0x0F) +'0';
+ *(out+pos+6) = (block.bytes[3]&0xF0>>4)+'0';
+ *(out+pos+7) = (block.bytes[3]&0x0F) +'0';
+}
----------------
I'm not sure, but it looks like this may give different answers on big versus little endian machines (which is something the previous implementation didn't have)?
Maybe just printing out the sum of each row of a matrix (i.e. 4000 floats being printed) instead of the entire matrix (4000*4000 floats being printed) already reduces the IO overhead to be in the noise? If not, a couple of rows could be summed up together to reduce the amount of IO further?
Repository:
rL LLVM
http://reviews.llvm.org/D10991
More information about the llvm-commits
mailing list