T4 Performance Counters explained
- by user13346607
Now that T4 is out for a few month some people might have wondered what details of the new pipeline you can monitor. A "cpustat -h" lists a lot of events that can be monitored, and only very few are self-explanatory. I will try to give some insight on all of them, some of these "PIC events" require an in-depth knowledge of T4 pipeline. Over time I will try to explain these, for the time being these events should simply be ignored. (Side note: some counters changed from tape-out 1.1 (*only* used in the T4 beta program) to tape-out 1.2 (used in the systems shipping today) The table only lists the tape-out 1.2 counters)
0
0
1
1058
6033
Oracle Microelectronics
50
14
7077
14.0
Normal
0
false
false
false
EN-US
JA
X-NONE
/* Style Definitions */
table.MsoNormalTable
{mso-style-name:"Table Normal";
mso-tstyle-rowband-size:0;
mso-tstyle-colband-size:0;
mso-style-noshow:yes;
mso-style-priority:99;
mso-style-parent:"";
mso-padding-alt:0cm 5.4pt 0cm 5.4pt;
mso-para-margin:0cm;
mso-para-margin-bottom:.0001pt;
mso-pagination:widow-orphan;
font-size:12.0pt;
font-family:Cambria;
mso-ascii-font-family:Cambria;
mso-ascii-theme-font:minor-latin;
mso-hansi-font-family:Cambria;
mso-hansi-theme-font:minor-latin;}
pic name (cpustat)
Prose Comment
Sel-pipe-drain-cycles,
Sel-0-[wait|ready],
Sel-[1,2]
Sel-0-wait
counts cycles a strand waits to be selected. Some reasons can be counted in
detail; these are:
Sel-0-ready: Cycles a strand was
ready but not selected, that can signal pipeline oversubscription
Sel-1: Cycles only one
instruction or µop was selected
Sel-2: Cycles two instructions
or µops were selected
Sel-pipe-drain-cycles: cf. PRM
footnote 8 to table 10.2
Pick-any,
Pick-[0|1|2|3]
Cycles one,
two, three, no or at least one instruction or µop is picked
Instr_FGU_crypto
Number of FGU
or crypto instructions executed on that vcpu
Instr_ld
dto. for
load
Instr_st
dto. for store
SPR_ring_ops
dto. for SPR
ring ops
Instr_other
dto. for all
other instructions not listed above, PRM footnote 7 to table 10.2 lists the
instructions
Instr_all
total number
of instructions executed on that vcpu
Sw_count_intr
Nr of S/W
count instructions on that vcpu (sethi %hi(fc000),%g0 (whatever that is))
Atomics
nr of atomic
ops, which are LDSTUB/a, CASA/XA, and SWAP/A
SW_prefetch
Nr of
PREFETCH or PREFETCHA instructions
Block_ld_st
Block loads
or store on that vcpu
IC_miss_nospec,
IC_miss_[L2_or_L3|local|remote]\ _hit_nospec
Various I$
misses, distinguished by where they hit. All of these count per thread, but
only primary events: T4 counts only the first occurence of an I$ miss on a
core for a certain instruction. If one strand misses in I$ this miss is
counted, but if a second strand on the same core misses while the first miss
is being resolved, that second miss is not counted
This flavour of I$ misses counts only misses that are caused by instruction
that really commit (note the "_nospec")
BTC_miss
Branch target
cache miss
ITLB_miss
ITLB misses
(synchronously counted)
ITLB_miss_asynch
dto. but
asynchronously
[I|D]TLB_fill_\ [8KB|64KB|4MB|256MB|2GB|trap]
H/W tablewalk
events that fill ITLB or DTLB with translation for the corresponding page
size. The “_trap”
event occurs if the HWTW was not able to fill the corresponding TLB
IC_mtag_miss,
IC_mtag_miss_\ [ptag_hit|ptag_miss|\ ptag_hit_way_mismatch]
I$ micro tag
misses, with some options for drill down
Fetch-0,
Fetch-0-all
fetch-0
counts nr of cycles nothing was fetched for this particular strand,
fetch-0-all counts cycles nothing was fetched for all strands on a core
Instr_buffer_full
Cycles the
instruction buffer for a strand was full, thereby preventing any fetch
BTC_targ_incorrect
Counts all
occurences of wrongly predicted branch targets from the BTC
[PQ|ROB|LB|ROB_LB|SB|\ ROB_SB|LB_SB|RB_LB_SB|\ DTLB_miss]\ _tag_wait
ST_q_tag_wait
is listed under sl=20.
These counters
monitor pipeline behaviour therefore they are not strand specific:
PQ_...: cycles Rename stage
waits for a Pick Queue tag (might signal memory bound workload for
single thread mode, cf. Mail from Richard Smith)
ROB_...: cycles Select stage
waits for a ROB (ReOrderBuffer) tag
LB_...: cycles Select stage
waits for a Load Buffer tag
SB_...: cycles Select stage
waits for Store Buffer tag
combinations of the above are
allowed, although some of these events can overlap, the counter will
only be incremented once per cycle if any of these occur
DTLB_...: cycles load or store
instructions wait at Pick stage for a DTLB miss tag
[ID]TLB_HWTW_\ [L2_hit|L3_hit|L3_miss|all]
Counters for
HWTW accesses caused by either DTLB or ITLB misses. Canbe further detailed by
where they hit
IC_miss_L2_L3_hit,
IC_miss_local_remote_remL3_hit,
IC_miss
I$ prefetches
that were dropped because they either miss in L2$ or L3$
This variant counts misses regardless if the causing instruction commits or
not
DC_miss_nospec,
DC_miss_[L2_L3|local|remote_L3]\ _hit_nospec
D$ misses
either in general or detailed by where they hit
cf. the explanation for the IC_miss in two flavours for an explanation of
_nospec and the reasoning for two DC_miss counters
DTLB_miss_asynch
counts all
DTLB misses asynchronously, there is no way to count them synchronously
DC_pref_drop_DC_hit,
SW_pref_drop_[DC_hit|buffer_full]
L1-D$ h/w
prefetches that were dropped because of a D$ hit, counted per core.
The others count software prefetches per strand
[Full|Partial]_RAW_hit_st_[buf|q]
Count events
where a load wants to get data that has not yet been stored, i. e. it is
still inside the pipeline. The data might be either still in the store buffer
or in the store queue. If the load's data matches in the SB and in the store
queue the data in buffer takes precedence of course since it is younger
[IC|DC]_evict_invalid,
[IC|DC|L1]_snoop_invalid,
[IC|DC|L1]_invalid_all
Counter for
invalidated cache evictions per core
St_q_tag_wait
Number of
cycles pipeline waits for a store queue tag, of course counted per core
Data_pref_[drop_L2|drop_L3|\ hit_L2|hit_L3|\ hit_local|hit_remote]
Data prefetches
that can be further detailed by either why they were dropped or where they
did hit
St_hit_[L2|L3],
St_L2_[local|remote]_C2C,
St_local, St_remote
Store events
distinguished by where they hit or where they cause a L2 cache-to-cache
transfer, i.e. either a transfer from another L2$ on the same die or from a
different die
DC_miss,
DC_miss_\ [L2_L3|local|remote]_hit
D$ misses
either in general or detailed by where they hit
cf. the explanation for the IC_miss in two flavours for an explanation of
_nospec and the reasoning for two DC_miss counters
L2_[clean|dirty]_evict
Per core
clean or dirty L2$ evictions
L2_fill_buf_full,
L2_wb_buf_full,
L2_miss_buf_full
Per core
L2$ buffer events, all count number of cycles that this state was present
L2_pipe_stall
Per core
cycles pipeline stalled because of L2$
Branches
Count
branches (Tcc, DONE, RETRY, and SIT are not counted as branches)
Br_taken
Counts taken
branches (Tcc, DONE, RETRY, and SIT are not counted as branches)
Br_mispred,
Br_dir_mispred,
Br_trg_mispred,
Br_trg_mispred_\ [far_tbl|indir_tbl|ret_stk]
Counter for
various branch misprediction events.
Cycles_user
counts
cycles, attribute setting hpriv, nouser, sys controls addess space to count in
Commit-[0|1|2],
Commit-0-all,
Commit-1-or-2
Number of
times either no, one, or two µops commit for a strand. Commit-0-all counts
number of times no µop commits for the whole core, cf. footnote 11 to table
10.2 in PRM for a more detailed explanation on how this counters interacts
with the privilege levels