Background
One of the more common issues we've been seeing in the field is the
growing difficulty in optimizing performance of multi-threaded
applications. A good portion of this difficulty is due to the increasing
complexity of modern processors that present various degrees of sharing
relationships between hardware components. Take any current CMT
processor and you'll find any number of CPUs sharing execution
pipelines, floating point units, caches, etc. Consequently, applying the
traditional recipe of one software thread for each CPU will have
varying degrees of success, according to the layout of the underlying
hardware.
On top of this increasing complexity we've also seen processors with
features that aim at dynamically resourcing software threads according
to their utilization. Intel's Turbo Boost allows processors to
increase their operating frequency if there is enough thermal headroom
available and the processor isn't fully utilized. More recently, the
SPARC T4 processor introduced dynamic threading, allowing each core to
dynamically allocate more resources to its active CPUs. Both
cases are in essence recognizing that current processors will be running a wide
mix of workloads, some will be designed for throughput, others for low
latency. The hardware is providing mechanisms to dynamically resource
threads according to their runtime behavior. We're very aware of these challenges in Solaris, and have been working
to provide the best out of box performance while providing mechanisms to
further optimize applications when necessary. The Critical Threads
Optimzation was introduced in Solaris 10 8/11 and Solaris 11 as one
such mechanism that allows customers to both address issues caused by
contention over shared hardware resources and explicitly take advantage
of features such as T4's dynamic threading.
What it is
The basic idea is to allow performance critical threads to execute with
more exclusive access to hardware resources. For example, when deploying
an application that implements a producer/consumer model, it'll likely
be advantageous to give the producer more exclusive access to the
hardware instead of having it competing for resources with all the
consumers. In the case of a T4 based system, we may want to have a
producer running by itself on a single core and create one consumer for
each of the remaining CPUs.
With the Critical Threads Optimization we're extending the semantics of
scheduling priorities (which thread should run first) to include
priority over shared resources (which thread should have more "space").
Now the scheduler will not only run higher priority threads first: it
will also provide them with more exclusive access to hardware resources
if they are available.
How does it work ?
Using the previous example in Solaris 11, all you'd have to do would be
to place the producer in the Fixed Priority (FX) scheduling class at
priority 60, or in the Real Time (RT) class at any priority and Solaris will
try to give it more "hardware space". On both Solaris 10 8/11 and Solaris 11 this can be achieved through the existing priocntl(1,2) and priocntlset(2) interfaces. If your application already assigns these priorities to performance critical threads, there's no additional step you need to take.
One important aspect of this optimization is that it requires some level of idleness in the system, either as a result of sizing the application before hand or through periods of transient idleness during runtime. If the system is fully committed, the scheduler will put all the available CPUs to work.Best practices
If you're an application developer, we encourage you to look into assigning the right priorities for the different threads in your application. Solaris provides different scheduling classes (Time Share, Interactive, Fair Share, Fixed Priority and Real Time) that offer different policies and behaviors. It is not always simple to figure out which set of threads are critical to the performance of a workload, and it may not always be feasible to take advantage of this optimization, but we believe that this can be correctly (and safely) done during development.
Overall, the out of box performance in Solaris should meet your workload's requirements. If you are looking into that extra bit of performance, then the Critical Threads Optimization may be what you're looking for.