[PATCH] D30886: [ELF] Pad x86 executable sections with 0xcc int3 instructions

Thu Mar 16 15:27:46 PDT 2017

ruiu added a comment.

I applied this patch and still see the difference. My machine is a 2-socket Xeon E5-2680 v2 @ 2.80GHz. I used `perf stat -r 5 <linker command line>` to run the linker five times for each test condition.

With this patch:

     32169.548068 task-clock (msec)         #    3.716 CPUs utilized            ( +-  2.78% )
          179,956 context-switches          #    0.006 M/sec                    ( +-  0.24% )
           11,756 cpu-migrations            #    0.365 K/sec                    ( +-  6.92% )
        2,569,253 page-faults               #    0.080 M/sec                    ( +-  0.99% )
   82,125,867,840 cycles                    #    2.553 GHz                      ( +-  2.96% )
   61,896,938,065 stalled-cycles-frontend   #   75.37% frontend cycles idle     ( +-  3.66% )
  <not supported> stalled-cycles-backend  
   52,720,252,362 instructions              #    0.64  insns per cycle        
                                            #    1.17  stalled cycles per insn  ( +-  0.34% )
   10,152,771,263 branches                  #  315.602 M/sec                    ( +-  0.45% )
      162,810,720 branch-misses             #    1.60% of all branches          ( +-  0.19% )

      8.658035887 seconds time elapsed                                          ( +-  1.06% )

Without this patch:

     32446.779894 task-clock (msec)         #    3.929 CPUs utilized            ( +-  1.66% )
          181,730 context-switches          #    0.006 M/sec                    ( +-  0.44% )
           11,550 cpu-migrations            #    0.356 K/sec                    ( +-  7.07% )
        2,550,740 page-faults               #    0.079 M/sec                    ( +-  0.79% )
   82,878,006,345 cycles                    #    2.554 GHz                      ( +-  2.15% )
   62,603,702,917 stalled-cycles-frontend   #   75.54% frontend cycles idle     ( +-  2.79% )
  <not supported> stalled-cycles-backend
   52,760,009,322 instructions              #    0.64  insns per cycle
                                            #    1.19  stalled cycles per insn  ( +-  0.21% )
   10,165,389,935 branches                  #  313.294 M/sec                    ( +-  0.24% )
      163,535,769 branch-misses             #    1.61% of all branches          ( +-  0.29% )

      8.259118872 seconds time elapsed                                          ( +-  1.04% )

In this patch, one thread writes to a large memory region to initialize that with a fixed pattern, and then immediately after that, other threads read from and write to that memory region. I think that memory access pattern can be quite expensive on a mult-processor machine. But I don't know why that doesn't happen on your machine.

https://reviews.llvm.org/D30886