T4 Performance Counters explained
Posted
by user13346607
on Oracle Blogs
See other posts from Oracle Blogs
or by user13346607
Published on Mon, 26 Mar 2012 03:44:15 -0500
Indexed on
2012/03/26
11:38 UTC
Read the original article
Hit count: 397
/SPARC T4
Now that T4 is out for a few month some people might have wondered what details of the new pipeline you can monitor. A "cpustat -h" lists a lot of events that can be monitored, and only very few are self-explanatory. I will try to give some insight on all of them, some of these "PIC events" require an in-depth knowledge of T4 pipeline. Over time I will try to explain these, for the time being these events should simply be ignored. (Side note: some counters changed from tape-out 1.1 (*only* used in the T4 beta program) to tape-out 1.2 (used in the systems shipping today) The table only lists the tape-out 1.2 counters)
pic name (cpustat) |
Prose Comment |
Sel-pipe-drain-cycles,
|
Sel-0-wait
counts cycles a strand waits to be selected. Some reasons can be counted in
detail; these are:
|
Pick-any,
Pick-[0|1|2|3] |
Cycles one,
two, three, no or at least one instruction or µop is picked |
Instr_FGU_crypto |
Number of FGU
or crypto instructions executed on that vcpu |
Instr_ld |
dto. for
load |
Instr_st |
dto. for store |
SPR_ring_ops |
dto. for SPR
ring ops |
Instr_other |
dto. for all
other instructions not listed above, PRM footnote 7 to table 10.2 lists the
instructions |
Instr_all |
total number
of instructions executed on that vcpu |
Sw_count_intr |
Nr of S/W
count instructions on that vcpu (sethi %hi(fc000),%g0 (whatever that is)) |
Atomics |
nr of atomic
ops, which are LDSTUB/a, CASA/XA, and SWAP/A |
SW_prefetch |
Nr of
PREFETCH or PREFETCHA instructions |
Block_ld_st |
Block loads
or store on that vcpu |
IC_miss_nospec, |
Various I$
misses, distinguished by where they hit. All of these count per thread, but
only primary events: T4 counts only the first occurence of an I$ miss on a
core for a certain instruction. If one strand misses in I$ this miss is
counted, but if a second strand on the same core misses while the first miss
is being resolved, that second miss is not counted |
BTC_miss |
Branch target
cache miss |
ITLB_miss |
ITLB misses
(synchronously counted) |
ITLB_miss_asynch |
dto. but
asynchronously |
[I|D]TLB_fill_\ |
H/W tablewalk
events that fill ITLB or DTLB with translation for the corresponding page
size. The “_trap”
event occurs if the HWTW was not able to fill the corresponding TLB |
IC_mtag_miss, |
I$ micro tag
misses, with some options for drill down |
Fetch-0,
Fetch-0-all |
fetch-0
counts nr of cycles nothing was fetched for this particular strand,
fetch-0-all counts cycles nothing was fetched for all strands on a core |
Instr_buffer_full |
Cycles the
instruction buffer for a strand was full, thereby preventing any fetch |
BTC_targ_incorrect |
Counts all
occurences of wrongly predicted branch targets from the BTC |
[PQ|ROB|LB|ROB_LB|SB|\ |
ST_q_tag_wait
is listed under sl=20. These counters
monitor pipeline behaviour therefore they are not strand specific:
|
[ID]TLB_HWTW_\ |
Counters for
HWTW accesses caused by either DTLB or ITLB misses. Canbe further detailed by
where they hit |
IC_miss_L2_L3_hit, |
I$ prefetches
that were dropped because they either miss in L2$ or L3$ |
DC_miss_nospec,
DC_miss_[L2_L3|local|remote_L3]\ |
D$ misses
either in general or detailed by where they hit |
DTLB_miss_asynch |
counts all
DTLB misses asynchronously, there is no way to count them synchronously |
DC_pref_drop_DC_hit,
SW_pref_drop_[DC_hit|buffer_full] |
L1-D$ h/w
prefetches that were dropped because of a D$ hit, counted per core.
The others count software prefetches per strand |
[Full|Partial]_RAW_hit_st_[buf|q] |
Count events
where a load wants to get data that has not yet been stored, i. e. it is
still inside the pipeline. The data might be either still in the store buffer
or in the store queue. If the load's data matches in the SB and in the store
queue the data in buffer takes precedence of course since it is younger |
[IC|DC]_evict_invalid,
|
Counter for
invalidated cache evictions per core |
St_q_tag_wait |
Number of
cycles pipeline waits for a store queue tag, of course counted per core |
Data_pref_[drop_L2|drop_L3|\ |
Data prefetches
that can be further detailed by either why they were dropped or where they
did hit |
St_hit_[L2|L3],
|
Store events
distinguished by where they hit or where they cause a L2 cache-to-cache
transfer, i.e. either a transfer from another L2$ on the same die or from a
different die |
DC_miss,
DC_miss_\ |
D$ misses
either in general or detailed by where they hit |
L2_[clean|dirty]_evict |
Per core
clean or dirty L2$ evictions |
L2_fill_buf_full, |
Per core
L2$ buffer events, all count number of cycles that this state was present |
L2_pipe_stall |
Per core
cycles pipeline stalled because of L2$ |
Branches |
Count
branches (Tcc, DONE, RETRY, and SIT are not counted as branches) |
Br_taken |
Counts taken
branches (Tcc, DONE, RETRY, and SIT are not counted as branches) |
Br_mispred,
|
Counter for
various branch misprediction events. |
Cycles_user |
counts
cycles, attribute setting hpriv, nouser, sys controls addess space to count in |
Commit-[0|1|2],
|
Number of
times either no, one, or two µops commit for a strand. Commit-0-all counts
number of times no µop commits for the whole core, cf. footnote 11 to table
10.2 in PRM for a more detailed explanation on how this counters interacts
with the privilege levels |
© Oracle Blogs or respective owner