Best Practices - Dynamic Reconfiguration

Posted by jsavit on Oracle Blogs See other posts from Oracle Blogs or by jsavit
Published on Sat, 1 Sep 2012 00:30:47 +0000 Indexed on 2012/09/01 3:44 UTC
Read the original article Hit count: 313

Filed under:
This post is one of a series of "best practices" notes for Oracle VM Server for SPARC (formerly named Logical Domains)

Overview of dynamic Reconfiguration

Oracle VM Server for SPARC supports Dynamic Reconfiguration (DR), making it possible to add or remove resources to or from a domain (virtual machine) while it is running. This is extremely useful because resources can be shifted to or from virtual machines in response to load conditions without having to reboot or interrupt running applications. For example, if an application requires more CPU capacity, you can add CPUs to improve performance, and remove them when they are no longer needed. You can use even use Dynamic Resource Management (DRM) policies that automatically add and remove CPUs to domains based on load.

How it works (in broad general terms)

Dynamic Reconfiguration is done in coordination with Solaris, which recognises a hypervisor request to change its virtual machine configuration and responds appropriately. In essence, Solaris receives a message saying "you now have 16 more CPUs numbered 16 to 31" or "8GB more RAM starting at address X" or "here's a new network or disk device - have fun with it". These actions take very little time.

Solaris then can start using the new resource. In the case of added CPUs, that means dispatching processes and potentially binding interrupts to the new CPUs. For memory, Solaris adds the new memory pages to its "free" list and starts using them. Comparable actions occur with network and disk devices: they are recognised by Solaris and then used.

Removing is the reverse process: after receiving the DR message to free specific CPUs, Solaris unbinds interrupts assigned to the CPUs and stops dispatching process threads. That takes very little time.

primary # ldm list
NAME             STATE      FLAGS   CONS    VCPU  MEMORY   UTIL  UPTIME
primary          active     -n-cv-  SP      16    4G       1.0%  6d 22h 29m
ldom1            active     -n----  5000    16    8G       0.9%  6h 59m
primary # ldm set-core 5 ldom1
primary # ldm list
NAME             STATE      FLAGS   CONS    VCPU  MEMORY   UTIL  UPTIME
primary          active     -n-cv-  SP      16    4G       0.2%  6d 22h 29m
ldom1            active     -n----  5000    40    8G       0.1%  6h 59m
primary # ldm set-core 2 ldom1
primary # ldm list
NAME             STATE      FLAGS   CONS    VCPU  MEMORY   UTIL  UPTIME
primary          active     -n-cv-  SP      16    4G       1.0%  6d 22h 29m
ldom1            active     -n----  5000    16    8G       0.9%  6h 59m

Memory pages are vacated by copying their contents to other memory locations and wiping them clean. Solaris may have to swap memory contents to disk if the remaining RAM isn't enough to hold all the contents. For this reason, deallocating memory can take longer on a loaded system. Even on a lightly loaded system it took several 7 or 8 seconds to switch the domain below between 8GB and 24GB of RAM.

primary # ldm set-mem 24g ldom1
primary # ldm list
NAME             STATE      FLAGS   CONS    VCPU  MEMORY   UTIL  UPTIME
primary          active     -n-cv-  SP      16    4G       0.1%  6d 22h 36m
ldom1            active     -n----  5000    16    24G      0.2%  7h 6m
primary # ldm set-mem 8g ldom1
primary # ldm list
NAME             STATE      FLAGS   CONS    VCPU  MEMORY   UTIL  UPTIME
primary          active     -n-cv-  SP      16    4G       0.7%  6d 22h 37m
ldom1            active     -n----  5000    16    8G       0.3%  7h 7m

What if the device is in use?

(this is the anecdote that inspired this blog post)

If CPU or memory is being removed, releasing it pretty straightforward, using the method described above. The resources are released, and Solaris continues with less capacity. It's not as simple with a network or I/O device: you don't want to yank a device out from underneath an application that might be using it. In the following example, I've added a virtual network device to ldom1 and want to take it away, even though it's been plumbed.

primary # ldm rm-vnet vnet19  ldom1
Guest LDom returned the following reason for failing the operation:

                         Resource                                 Information
----------------------------------------------------------  -----------------------
/devices/virtual-devices@100/channel-devices@200/network@1  Network interface net1

VIO operation failed because device is being used in LDom ldom1
Failed to remove VNET instance 

That's what I call a helpful error message - telling me exactly what was wrong. In this case the problem is easily solved. I know this NIC is seen in the guest as net1 so:

ldom1 # ifconfig net1 down unplumb 

Now I can dispose of it, and even the virtual switch I had created for it:

primary # ldm rm-vnet vnet19  ldom1
primary # ldm rm-vsw primary-vsw9 

If I had to take away the device disruptively, I could have used ldm rm-vnet -f but that could disrupt whoever was using it. It's better if that can be avoided.

Summary

Oracle VM Server for SPARC provides dynamic reconfiguration, which lets you modify a guest domain's CPU, memory and I/O configuration on the fly without reboot. You can add and remove resources as needed, and even automate this for CPUs by setting up resource policies.

Taking things away can be more complicated than giving, especially for devices like disks and networks that may contain application and system state or be involved in a transaction. LDoms and Solaris cooperative work together to coordinate resource allocation and de-allocation in a safe and effective way. For best practices, use dynamic reconfiguration to make the best use of your system's resources.

© Oracle Blogs or respective owner

Related posts about /Oracle