T4 Performance Counters explained

Posted by user13346607 on Oracle Blogs See other posts from Oracle Blogs or by user13346607
Published on Mon, 26 Mar 2012 03:44:15 -0500 Indexed on 2012/03/26 11:38 UTC
Read the original article Hit count: 397

Filed under:

Now that T4 is out for a few month some people might have wondered what details of the new pipeline you can monitor. A "cpustat -h" lists a lot of events that can be monitored, and only very few are self-explanatory. I will try to give some insight on all of them, some of these "PIC events" require an in-depth knowledge of T4 pipeline. Over time I will try to explain these, for the time being these events should simply be ignored. (Side note: some counters changed from tape-out 1.1 (*only* used in the T4 beta program) to tape-out 1.2 (used in the systems shipping today) The table only lists the tape-out 1.2 counters)

pic name (cpustat)

Prose Comment

Sel-pipe-drain-cycles,
Sel-0-[wait|ready],
Sel-[1,2]

Sel-0-wait counts cycles a strand waits to be selected. Some reasons can be counted in detail; these are:

  • Sel-0-ready: Cycles a strand was ready but not selected, that can signal pipeline oversubscription
  • Sel-1: Cycles only one instruction or µop was selected
  • Sel-2: Cycles two instructions or µops were selected
  • Sel-pipe-drain-cycles: cf. PRM footnote 8 to table 10.2

Pick-any, Pick-[0|1|2|3]

Cycles one, two, three, no or at least one instruction or µop is picked

Instr_FGU_crypto

Number of FGU or crypto instructions executed on that vcpu

Instr_ld

dto. for load

Instr_st

dto. for store

SPR_ring_ops

dto. for SPR ring ops

Instr_other

dto. for all other instructions not listed above, PRM footnote 7 to table 10.2 lists the instructions

Instr_all

total number of instructions executed on that vcpu

Sw_count_intr

Nr of S/W count instructions on that vcpu (sethi %hi(fc000),%g0 (whatever that is)) 

Atomics

nr of atomic ops, which are LDSTUB/a, CASA/XA, and SWAP/A

SW_prefetch

Nr of PREFETCH or PREFETCHA instructions

Block_ld_st

Block loads or store on that vcpu

IC_miss_nospec,
IC_miss_[L2_or_L3|local|remote]\
_hit_nospec

Various I$ misses, distinguished by where they hit. All of these count per thread, but only primary events: T4 counts only the first occurence of an I$ miss on a core for a certain instruction. If one strand misses in I$ this miss is counted, but if a second strand on the same core misses while the first miss is being resolved, that second miss is not counted
This flavour of I$ misses counts only misses that are caused by instruction that really commit (note the "_nospec")

BTC_miss

Branch target cache miss

ITLB_miss

ITLB misses (synchronously counted)

ITLB_miss_asynch

dto. but asynchronously

[I|D]TLB_fill_\
[8KB|64KB|4MB|256MB|2GB|trap]

H/W tablewalk events that fill ITLB or DTLB with translation for the corresponding page size. The “_trap” event occurs if the HWTW was not able to fill the corresponding TLB

IC_mtag_miss,
IC_mtag_miss_\
[ptag_hit|ptag_miss|\
ptag_hit_way_mismatch]

I$ micro tag misses, with some options for drill down

Fetch-0, Fetch-0-all

fetch-0 counts nr of cycles nothing was fetched for this particular strand, fetch-0-all counts cycles nothing was fetched for all strands on a core

Instr_buffer_full

Cycles the instruction buffer for a strand was full, thereby preventing any fetch

BTC_targ_incorrect

Counts all occurences of wrongly predicted branch targets from the BTC

[PQ|ROB|LB|ROB_LB|SB|\
ROB_SB|LB_SB|RB_LB_SB|\
DTLB_miss]\
_tag_wait

ST_q_tag_wait is listed under sl=20.

These counters monitor pipeline behaviour therefore they are not strand specific:

  • PQ_...: cycles Rename stage waits for a Pick Queue tag (might signal memory bound workload for single thread mode, cf. Mail from Richard Smith)
  • ROB_...: cycles Select stage waits for a ROB (ReOrderBuffer) tag
  • LB_...: cycles Select stage waits for a Load Buffer tag
  • SB_...: cycles Select stage waits for Store Buffer tag
  • combinations of the above are allowed, although some of these events can overlap, the counter will only be incremented once per cycle if any of these occur
  • DTLB_...: cycles load or store instructions wait at Pick stage for a DTLB miss tag

[ID]TLB_HWTW_\
[L2_hit|L3_hit|L3_miss|all]

Counters for HWTW accesses caused by either DTLB or ITLB misses. Canbe further detailed by where they hit

IC_miss_L2_L3_hit,
IC_miss_local_remote_remL3_hit,
IC_miss

I$ prefetches that were dropped because they either miss in L2$ or L3$
This variant counts misses regardless if the causing instruction commits or not

DC_miss_nospec, DC_miss_[L2_L3|local|remote_L3]\
_hit_nospec

D$ misses either in general or detailed by where they hit
cf. the explanation for the IC_miss in two flavours for an explanation of _nospec and the reasoning for two DC_miss counters

DTLB_miss_asynch

counts all DTLB misses asynchronously, there is no way to count them synchronously

DC_pref_drop_DC_hit, SW_pref_drop_[DC_hit|buffer_full]

L1-D$ h/w prefetches that were dropped because of a D$ hit, counted per core. The others count software prefetches per strand

[Full|Partial]_RAW_hit_st_[buf|q]

Count events where a load wants to get data that has not yet been stored, i. e. it is still inside the pipeline. The data might be either still in the store buffer or in the store queue. If the load's data matches in the SB and in the store queue the data in buffer takes precedence of course since it is younger

[IC|DC]_evict_invalid,
[IC|DC|L1]_snoop_invalid,
[IC|DC|L1]_invalid_all

Counter for invalidated cache evictions per core

St_q_tag_wait

Number of cycles pipeline waits for a store queue tag, of course counted per core

Data_pref_[drop_L2|drop_L3|\
hit_L2|hit_L3|\
hit_local|hit_remote]

Data prefetches that can be further detailed by either why they were dropped or where they did hit

St_hit_[L2|L3],
St_L2_[local|remote]_C2C,
St_local, St_remote

Store events distinguished by where they hit or where they cause a L2 cache-to-cache transfer, i.e. either a transfer from another L2$ on the same die or from a different die

DC_miss, DC_miss_\
[L2_L3|local|remote]_hit

D$ misses either in general or detailed by where they hit
cf. the explanation for the IC_miss in two flavours for an explanation of _nospec and the reasoning for two DC_miss counters

L2_[clean|dirty]_evict

Per core clean or dirty L2$ evictions

L2_fill_buf_full,
L2_wb_buf_full,
L2_miss_buf_full

Per core L2$ buffer events, all count number of cycles that this state was present

L2_pipe_stall

Per core cycles pipeline stalled because of L2$

Branches

Count branches (Tcc, DONE, RETRY, and SIT are not counted as branches)

Br_taken

Counts taken branches (Tcc, DONE, RETRY, and SIT are not counted as branches)

Br_mispred,
Br_dir_mispred,
Br_trg_mispred,
Br_trg_mispred_\
[far_tbl|indir_tbl|ret_stk]

Counter for various branch misprediction events. 

Cycles_user

counts cycles, attribute setting hpriv, nouser, sys controls addess space to count in

Commit-[0|1|2],
Commit-0-all,
Commit-1-or-2

Number of times either no, one, or two µops commit for a strand. Commit-0-all counts number of times no µop commits for the whole core, cf. footnote 11 to table 10.2 in PRM for a more detailed explanation on how this counters interacts with the privilege levels

© Oracle Blogs or respective owner

Related posts about /SPARC T4