Faster Memory Allocation Using vmtasks
- by Steve Sistare
You may have noticed a new system process called "vmtasks" on
Solaris 11 systems:
% pgrep vmtasks
8
% prstat -p 8
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
8 root 0K 0K sleep 99 -20 9:10:59 0.0% vmtasks/32
What is vmtasks, and why should you care? In a nutshell, vmtasks
accelerates creation, locking, and destruction of pages in shared
memory segments. This is particularly helpful for locked memory, as
creating a page of physical memory is much more expensive than
creating a page of virtual memory. For example, an ISM segment
(shmflag & SHM_SHARE_MMU) is locked in memory on the first
shmat() call, and a DISM segment (shmflg & SHM_PAGEABLE)
is locked using mlock() or memcntl().
Segment operations such as creation and
locking are typically single threaded, performed by the thread making
the system call. In many applications, the size of a shared memory
segment is a large fraction of total physical memory, and the
single-threaded initialization is a scalability bottleneck which
increases application startup time.
To break the bottleneck, we apply parallel processing, harnessing the
power of the additional CPUs that are always present on modern
platforms. For sufficiently large segments, as many of 16 threads of
vmtasks are employed to assist an application thread during
creation, locking, and destruction operations. The segment is
implicitly divided at page boundaries, and each thread is given a
chunk of pages to process. The per-page processing time can vary, so
for dynamic load balancing, the number of chunks is greater than the
number of threads, and threads grab chunks dynamically as they finish
their work. Because the threads modify a single application address
space in compressed time interval, contention on locks protecting VM
data structures locks was a problem, and we had to re-scale a number
of VM locks to get good parallel efficiency. The vmtasks process has 1
thread per CPU and may accelerate multiple segment operations
simultaneously, but each operation gets at most 16 helper threads to
avoid monopolizing CPU resources. We may reconsider this limit in
the future.
Acceleration using vmtasks is enabled out of the box, with no tuning
required, and works for all Solaris platform architectures (SPARC sun4u,
SPARC sun4v, x86).
The following tables show the time to create + lock + destroy a
large segment, normalized as milliseconds per gigabyte, before and
after the introduction of vmtasks:
ISM
system ncpu before after speedup
------ ---- ------ ----- -------
x4600 32 1386 245 6X
X7560 64 1016 153 7X
M9000 512 1196 206 6X
T5240 128 2506 234 11X
T4-2 128 1197 107 11x
DISM
system ncpu before after speedup
------ ---- ------ ----- -------
x4600 32 1582 265 6X
X7560 64 1116 158 7X
M9000 512 1165 152 8X
T5240 128 2796 198 14X
(I am missing the data for T4 DISM, for no good reason; it works fine).
The following table separates the creation and destruction times:
ISM, T4-2
before after
------ -----
create 702 64
destroy 495 43
To put this in perspective, consider creating a 512 GB ISM segment on
T4-2. Creating the segment would take 6 minutes with the old code,
and only 33 seconds with the new. If this is your Oracle SGA, you
save over 5 minutes when starting the database, and you also save
when shutting it down prior to a restart. Those minutes go directly
to your bottom line for service availability.