[LLVMdev] Dynamic Profiling - Instrumentation basic query

Mon Jan 14 22:11:10 PST 2013

Hi Silky,

On 14/01/13 01:47, Silky Arora wrote:
> I need to profile the code for branches (branch mis predicts
> simulation), load/store instructions (for cache hits/miss rate), and a
> couple of other things and therefore, would need to instrument the code.
> However, I would like to know if writing the output to a file would
> increase the execution time, or is it the profiling itself? I can
> probably use a data structure to store the output instead.
>
> Also, I have heard of Intel's Pin tool which can provide memory trace
> information. Could you please explain to me what you meant by hardware
> counters for dcache miss/hit rates.

I've also heard of Pin, but never actually used it.

Regarding the hardware counters: x86 processors count various hardware 
events via internal counters.  I think both Intel and AMD processors can 
do this, but I've only tried out Intel.  The easiest way to access these 
on Linux is probably via the 'perf' tool [1].  (There are other options 
on other platforms.  I think 'Intel VTune' can use these counters as well.)

[1] https://perf.wiki.kernel.org/

The result of running 'perf' on a random command (xz -9e dictionary) is 
in the attached file (because my mail client was destroying the 
formatting).  I just chose some counters which seemed to match what you 
mention, there were many more though.  'perf list' will show them.  The 
only issue I can think of is that the hardware counters aren't available 
inside (most?) virtual machines.

If you need to individually determine the hit/miss-rate, mispredict 
ratios etc per load/store/branch then I'm not sure if these counters are 
very useful.

Regards,
Alastair.
-------------- next part --------------
/usr/sbin/perf stat -e cycles -e instructions -e cache-references -e cache-misses -e branch-instructions -e branch-misses -e L1-dcache-loads -e L1-dcache-load-misses -e L1-dcache-stores -e L1-dcache-store-misses -e dTLB-loads -e dTLB-load-misses xz -9e dictionary

 Performance counter stats for 'xz -9e dictionary':

     2,838,843,997 cycles                    #    0.000 GHz                     [24.96%]
     3,017,892,661 instructions              #    1.06  insns per cycle         [33.31%]
        28,281,385 cache-references                                             [33.29%]
         6,820,873 cache-misses              #   24.118 % of all cache refs     [33.31%]
       403,480,157 branches                                                     [16.70%]
        34,978,751 branch-misses             #    8.67% of all branches         [16.71%]
     1,028,322,850 L1-dcache-loads
                                             [16.73%]
        30,888,348 L1-dcache-misses
         #    3.00% of all L1-dcache hits   [16.70%]
       278,389,483 L1-dcache-stores
                                            [16.66%]
        17,185,362 L1-dcache-misses
                                            [16.68%]
     1,023,191,908 dTLB-loads
                                                  [16.71%]
         1,585,411 dTLB-misses
              #    0.15% of all dTLB cache hits  [16.67%]

       2.892917184 seconds time elapsed