T4 Performance Counters explained

Posted by user13346607 on Oracle Blogs See other posts from Oracle Blogs or by user13346607
Published on Mon, 26 Mar 2012 03:44:15 -0500 Indexed on 2012/03/26 11:38 UTC
Read the original article Hit count: 438

Filed under:

/SPARC T4

Now that T4 is out for a few month some people might have wondered what details of the new pipeline you can monitor. A "cpustat -h" lists a lot of events that can be monitored, and only very few are self-explanatory. I will try to give some insight on all of them, some of these "PIC events" require an in-depth knowledge of T4 pipeline. Over time I will try to explain these, for the time being these events should simply be ignored. (Side note: some counters changed from tape-out 1.1 (*only* used in the T4 beta program) to tape-out 1.2 (used in the systems shipping today) The table only lists the tape-out 1.2 counters)

pic name (cpustat)	Prose Comment
Sel-pipe-drain-cycles, Sel-0-[wait\|ready], Sel-[1,2]	Sel-0-wait counts cycles a strand waits to be selected. Some reasons can be counted in detail; these are: Sel-0-ready: Cycles a strand was ready but not selected, that can signal pipeline oversubscription Sel-1: Cycles only one instruction or µop was selected Sel-2: Cycles two instructions or µops were selected Sel-pipe-drain-cycles: cf. PRM footnote 8 to table 10.2
Pick-any, Pick-[0\|1\|2\|3]	Cycles one, two, three, no or at least one instruction or µop is picked
Instr_FGU_crypto	Number of FGU or crypto instructions executed on that vcpu
Instr_ld	dto. for load
Instr_st	dto. for store
SPR_ring_ops	dto. for SPR ring ops
Instr_other	dto. for all other instructions not listed above, PRM footnote 7 to table 10.2 lists the instructions
Instr_all	total number of instructions executed on that vcpu
Sw_count_intr	Nr of S/W count instructions on that vcpu (sethi %hi(fc000),%g0 (whatever that is))
Atomics	nr of atomic ops, which are LDSTUB/a, CASA/XA, and SWAP/A
SW_prefetch	Nr of PREFETCH or PREFETCHA instructions
Block_ld_st	Block loads or store on that vcpu
IC_miss_nospec, IC_miss_[L2_or_L3\|local\|remote]\ _hit_nospec	Various I$ misses, distinguished by where they hit. All of these count per thread, but only primary events: T4 counts only the first occurence of an I$ miss on a core for a certain instruction. If one strand misses in I$ this miss is counted, but if a second strand on the same core misses while the first miss is being resolved, that second miss is not counted This flavour of I$ misses counts only misses that are caused by instruction that really commit (note the "_nospec")
BTC_miss	Branch target cache miss
ITLB_miss	ITLB misses (synchronously counted)
ITLB_miss_asynch	dto. but asynchronously
[I\|D]TLB_fill_\ [8KB\|64KB\|4MB\|256MB\|2GB\|trap]	H/W tablewalk events that fill ITLB or DTLB with translation for the corresponding page size. The “_trap” event occurs if the HWTW was not able to fill the corresponding TLB
IC_mtag_miss, IC_mtag_miss_\ [ptag_hit\|ptag_miss\|\ ptag_hit_way_mismatch]	I$ micro tag misses, with some options for drill down
Fetch-0, Fetch-0-all	fetch-0 counts nr of cycles nothing was fetched for this particular strand, fetch-0-all counts cycles nothing was fetched for all strands on a core
Instr_buffer_full	Cycles the instruction buffer for a strand was full, thereby preventing any fetch
BTC_targ_incorrect	Counts all occurences of wrongly predicted branch targets from the BTC
[PQ\|ROB\|LB\|ROB_LB\|SB\|\ ROB_SB\|LB_SB\|RB_LB_SB\|\ DTLB_miss]\ _tag_wait	ST_q_tag_wait is listed under sl=20. These counters monitor pipeline behaviour therefore they are not strand specific: PQ_...: cycles Rename stage waits for a Pick Queue tag (might signal memory bound workload for single thread mode, cf. Mail from Richard Smith) ROB_...: cycles Select stage waits for a ROB (ReOrderBuffer) tag LB_...: cycles Select stage waits for a Load Buffer tag SB_...: cycles Select stage waits for Store Buffer tag combinations of the above are allowed, although some of these events can overlap, the counter will only be incremented once per cycle if any of these occur DTLB_...: cycles load or store instructions wait at Pick stage for a DTLB miss tag
[ID]TLB_HWTW_\ [L2_hit\|L3_hit\|L3_miss\|all]	Counters for HWTW accesses caused by either DTLB or ITLB misses. Canbe further detailed by where they hit
IC_miss_L2_L3_hit, IC_miss_local_remote_remL3_hit, IC_miss	I$ prefetches that were dropped because they either miss in L2$ or L3$ This variant counts misses regardless if the causing instruction commits or not
DC_miss_nospec, DC_miss_[L2_L3\|local\|remote_L3]\ _hit_nospec	D$ misses either in general or detailed by where they hit cf. the explanation for the IC_miss in two flavours for an explanation of _nospec and the reasoning for two DC_miss counters
DTLB_miss_asynch	counts all DTLB misses asynchronously, there is no way to count them synchronously
DC_pref_drop_DC_hit, SW_pref_drop_[DC_hit\|buffer_full]	L1-D$ h/w prefetches that were dropped because of a D$ hit, counted per core. The others count software prefetches per strand
[Full\|Partial]_RAW_hit_st_[buf\|q]	Count events where a load wants to get data that has not yet been stored, i. e. it is still inside the pipeline. The data might be either still in the store buffer or in the store queue. If the load's data matches in the SB and in the store queue the data in buffer takes precedence of course since it is younger
[IC\|DC]_evict_invalid, [IC\|DC\|L1]_snoop_invalid, [IC\|DC\|L1]_invalid_all	Counter for invalidated cache evictions per core
St_q_tag_wait	Number of cycles pipeline waits for a store queue tag, of course counted per core
Data_pref_[drop_L2\|drop_L3\|\ hit_L2\|hit_L3\|\ hit_local\|hit_remote]	Data prefetches that can be further detailed by either why they were dropped or where they did hit
St_hit_[L2\|L3], St_L2_[local\|remote]_C2C, St_local, St_remote	Store events distinguished by where they hit or where they cause a L2 cache-to-cache transfer, i.e. either a transfer from another L2$ on the same die or from a different die
DC_miss, DC_miss_\ [L2_L3\|local\|remote]_hit	D$ misses either in general or detailed by where they hit cf. the explanation for the IC_miss in two flavours for an explanation of _nospec and the reasoning for two DC_miss counters
L2_[clean\|dirty]_evict	Per core clean or dirty L2$ evictions
L2_fill_buf_full, L2_wb_buf_full, L2_miss_buf_full	Per core L2$ buffer events, all count number of cycles that this state was present
L2_pipe_stall	Per core cycles pipeline stalled because of L2$
Branches	Count branches (Tcc, DONE, RETRY, and SIT are not counted as branches)
Br_taken	Counts taken branches (Tcc, DONE, RETRY, and SIT are not counted as branches)
Br_mispred, Br_dir_mispred, Br_trg_mispred, Br_trg_mispred_\ [far_tbl\|indir_tbl\|ret_stk]	Counter for various branch misprediction events.
Cycles_user	counts cycles, attribute setting hpriv, nouser, sys controls addess space to count in
Commit-[0\|1\|2], Commit-0-all, Commit-1-or-2	Number of times either no, one, or two µops commit for a strand. Commit-0-all counts number of times no µop commits for the whole core, cf. footnote 11 to table 10.2 in PRM for a more detailed explanation on how this counters interacts with the privilege levels

Developer IT

T4 Performance Counters explained - Developer IT

T4 Performance Counters explained

/SPARC T4

Related posts about /SPARC T4

?SPARC T4?????????????·???? : Netra SPARC T4-1

I have written an SQL query but I want to optimize it [closed]

SPARC T4 ??????: SPARC T4 ??????????!!

How to tell if SPARC T4 crypto is being used?

Oracle TimesTen In-Memory Database Performance on SPARC T4-2

Categories cloud