Critical Threads Optimization
- by Rafael Vanoni
Background  
  
One of the more common issues we've been seeing in the field is the 
growing difficulty in optimizing performance of multi-threaded 
applications. A good portion of this difficulty is due to the increasing 
complexity of modern processors that present various degrees of sharing 
relationships between hardware components. Take any current CMT 
processor and you'll find any number of CPUs sharing execution 
pipelines, floating point units, caches, etc. Consequently, applying the 
traditional recipe of one software thread for each CPU will have 
varying degrees of success, according to the layout of the underlying 
hardware.
 On top of this increasing complexity we've also seen processors with 
features that aim at dynamically resourcing software threads according 
to their utilization. Intel's Turbo Boost allows processors to 
increase their operating frequency if there is enough thermal headroom 
available and the processor isn't fully utilized. More recently, the 
SPARC T4 processor introduced dynamic threading, allowing each core to 
dynamically allocate more resources to its active CPUs. Both 
cases are in essence recognizing that current processors will be running a wide 
mix of workloads, some will be designed for throughput, others for low 
latency. The hardware is providing mechanisms to dynamically resource 
threads according to their runtime behavior. We're very aware of these challenges in Solaris, and have been working 
to provide the best out of box performance while providing mechanisms to 
further optimize applications when necessary. The Critical Threads 
Optimzation was introduced in Solaris 10 8/11 and Solaris 11 as one 
such mechanism that allows customers to both address issues caused by 
contention over shared hardware resources and explicitly take advantage 
of features such as T4's dynamic threading.
  
  What it is 
  The basic idea is to allow performance critical threads to execute with 
more exclusive access to hardware resources. For example, when deploying 
an application that implements a producer/consumer model, it'll likely 
be advantageous to give the producer more exclusive access to the 
hardware instead of having it competing for resources with all the 
consumers. In the case of a T4 based system, we may want to have a 
producer running by itself on a single core and create one consumer for 
each of the remaining CPUs.
 With the Critical Threads Optimization we're extending the semantics of 
scheduling priorities (which thread should run first) to include 
priority over shared resources (which thread should have more "space"). 
Now the scheduler will not only run higher priority threads first: it 
will also provide them with more exclusive access to hardware resources 
if they are available.
  
  How does it work ?  
  Using the previous example in Solaris 11, all you'd have to do would be 
to place the producer in the Fixed Priority (FX) scheduling class at 
priority 60, or in the Real Time (RT) class at any priority and Solaris will 
try to give it more "hardware space". On both Solaris 10 8/11 and Solaris 11 this can be achieved through the existing priocntl(1,2) and priocntlset(2) interfaces. If your application already assigns these priorities to performance critical threads, there's no additional step you need to take. 
  One important aspect of this optimization is that it requires some level of idleness in the system, either as a result of sizing the application before hand or through periods of transient idleness during runtime. If the system is fully committed, the scheduler will put all the available CPUs to work.Best practices 
  If you're an application developer, we encourage you to look into assigning the right priorities for the different threads in your application. Solaris provides different scheduling classes (Time Share, Interactive, Fair Share, Fixed Priority and Real Time) that offer different policies and behaviors. It is not always simple to figure out which set of threads are critical to the performance of a workload, and it may not always be feasible to take advantage of this optimization, but we believe that this can be correctly (and safely) done during development. 
  Overall, the out of box performance in Solaris should meet your workload's requirements. If you are looking into that extra bit of performance, then the Critical Threads Optimization may be what you're looking for.