High Resolution Timeouts
- by user12607257
The default resolution of application timers and timeouts is now 1
msec in Solaris 11.1, down from 10 msec in previous releases. This
improves out-of-the-box performance of polling and event based
applications, such as ticker applications, and even the Oracle rdbms
log writer. More on that in a moment.
As a simple example, the poll() system call takes a timeout argument
in units of msec:
System Calls poll(2)
NAME
poll - input/output multiplexing
SYNOPSIS
int poll(struct pollfd fds[], nfds_t nfds, int timeout);
In Solaris 11, a call to poll(NULL,0,1) returns in 10 msec, because
even though a 1 msec interval is requested, the implementation rounds
to the system clock resolution of 10 msec. In Solaris 11.1, this call
returns in 1 msec.
In specification lawyer terms, the resolution of CLOCK_REALTIME,
introduced by POSIX.1b real time extensions, is now 1 msec.
The function clock_getres(CLOCK_REALTIME,&res) returns 1 msec,
and any library calls whose man page explicitly mention CLOCK_REALTIME,
such as nanosleep(), are subject to the new resolution.
Additionally, many legacy functions that pre-date POSIX.1b and do not
explicitly mention a clock domain, such as poll(), are subject to the
new resolution. Here is a fairly comprehensive list:
nanosleep
pthread_mutex_timedlock pthread_mutex_reltimedlock_np
pthread_rwlock_timedrdlock pthread_rwlock_reltimedrdlock_np
pthread_rwlock_timedwrlock pthread_rwlock_reltimedwrlock_np
mq_timedreceive mq_reltimedreceive_np
mq_timedsend mq_reltimedsend_np
sem_timedwait sem_reltimedwait_np
poll select pselect
_lwp_cond_timedwait _lwp_cond_reltimedwait
semtimedop sigtimedwait
aiowait aio_waitn aio_suspend
port_get port_getn
cond_timedwait cond_reltimedwait
setitimer (ITIMER_REAL)
misc rpc calls, misc ldap calls
This change in resolution was made feasible because we made the
implementation of timeouts more efficient a few years back when we
re-architected the callout subsystem of Solaris. Previously,
timeouts were tested and expired by the kernel's clock thread which
ran 100 times per second, yielding a resolution of 10 msec. This
did not scale, as timeouts could be posted by every CPU, but were
expired by only a single thread. The resolution could be changed by
setting hires_tick=1 in /etc/system, but this caused the clock thread to
run at 1000 Hz, which made the potential scalability problem worse.
Given enough CPUs posting enough timeouts, the clock thread could be
a performance bottleneck. We fixed that by re-implementing the
timeout as a per-CPU timer interrupt (using the cyclic subsystem, for
those familiar with Solaris internals). This decoupled the clock
thread frequency from timeout resolution, and allowed us to improve
default timeout resolution without adding CPU overhead in the clock
thread.
Here are some exceptions for which the default resolution is still 10 msec.
The thread scheduler's time quantum is 10 msec by default,
because preemption is driven by the clock thread (plus helper threads
for scalability). See for example dispadmin, priocntl, fx_dptbl,
rt_dptbl, and ts_dptbl. This may be changed using hires_tick.
The resolution of the clock_t data type, primarily used in DDI functions,
is 10 msec. It may be changed using hires_tick. These functions are
only used by developers writing kernel modules.
A few functions that pre-date POSIX CLOCK_REALTIME mention
_SC_CLK_TCK, CLK_TCK, "system clock", or no clock domain. These
functions are still driven by the clock thread, and their resolution
is 10 msec. They include alarm, pcsample, times, clock, and
setitimer for ITIMER_VIRTUAL and ITIMER_PROF. Their resolution may
be changed using hires_tick.
Now back to the database. How does this help the Oracle log writer?
Foreground processes post a redo record to the log writer, which
releases them after the redo has committed. When a large number of
foregrounds are waiting, the release step can slow down the log
writer, so under heavy load, the foregrounds switch to a mode where
they poll for completion. This scales better because every
foreground can poll independently, but at the cost of waiting the
minimum polling interval. That was 10 msec, but is now 1 msec in
Solaris 11.1, so the foregrounds process transactions faster under
load. Pretty cool.