Availability Best Practices on Oracle VM Server for SPARC
- by jsavit
This is the first of a series of blog posts on configuring Oracle VM Server for SPARC (also called Logical Domains) for availability.
This series will show how to how to plan for availability, improve serviceability,
avoid single points of failure, and provide resiliency against hardware and software failures.
Availability is a broad topic that has filled entire books, so these posts will focus on aspects specifically related to Oracle VM Server for SPARC.
The goal is to improve Reliability, Availability and Serviceability (RAS):
An article defining RAS can be found
here.
Oracle VM Server for SPARC Principles for Availability
Let's state some guiding principles for availability that apply to Oracle VM Server for SPARC:
Avoid Single Points Of Failure (SPOFs).
Systems should be configured so a component failure does not result in a loss of application service. The general method to avoid SPOFs is to provide redundancy so service can continue without interruption if a component fails. For a critical application there may be multiple levels of redundancy so multiple failures can be tolerated. Oracle VM Server for SPARC makes it possible to configure systems
that avoid SPOFs.
Configure for availability at a level of resource and effort consistent with business needs.
Effort and resource should be consistent with business requirements. Production has different availability requirements than test/development, so it's worth expending resources to provide higher availability. Even within the category of production there may be different levels of criticality, outage tolerances, recovery and repair time requirements.
Keep in mind that a simple design may be more understandable and effective than a complex design that attempts to "do everything".
Design for availability at the appropriate tier or level of the platform stack.
Availability can be provided in the application, in the database, or in the virtualization, hardware and network layers they depend on - or using a combination of all of them.
It may not be necessary to engineer resilient virtualization for stateless web applications applications where availability is provided by a network load balancer, or for enterprise applications like Oracle Real Application Clusters (RAC) and WebLogic that provide their own resiliency.
It's (often) the same architecture whether virtual or not:
For example, providing resiliency against a lost device path or failing disk media
is done for the same reasons and may use the same design whether in a domain or not.
It's (often) the same technique whether using domains or not: Many configuration steps are the same. For example, configuring IPMP or creating a redundant ZFS pool is pretty much the same
within the guest whether you're in a guest domain or not.
There are configuration steps and choices for provisioning the guest with the virtual network and disk devices, which we will discuss.
Sometimes it is different using domains: There are new resources to configure.
Most notable is the use of alternate service domains, which provides resiliency in case of a domain failure, and also
permits improved serviceability via "rolling upgrades". This is an important differentiator
between Oracle VM Server for SPARC and traditional virtual machine environments where all virtual I/O
is provided by a monolithic infrastructure that itself is a SPOF.
Alternate service domains are widely used to provide resiliency in production logical domains environments.
Some things are done via logical domains commands, and some are done in the guest:
For example, with Oracle VM Server for SPARC we provide multiple network connections to the guest, and then configure network resiliency in the guest via IP Multi Pathing (IPMP) -
essentially the same as for non-virtual systems.
On the other hand, we configure virtual disk availability in the virtualization layer, and the guest sees an already-resilient disk without being aware of the details. These blogs will discuss configuration details like this.
Live migration is not "high availability" in the sense of "continuous availability": If the server is down, then you don't live migrate from it! (A cluster or VM restart elsewhere would be used). However, live migration can be part of the RAS (Reliability, Availability, Serviceability) picture by improving Serviceability - you can move running domains off of a box before planned service or maintenance. The blog Best Practices - Live Migration on Oracle VM Server for SPARC discusses this.
Topics
Here are some of the topics that will be covered:
Network availability using IP Multipathing and aggregates
Disk path availability using virtual disks defined with multipath groups ("mpgroup")
Disk media resiliency configuring for redundant disks that can tolerate media loss
Multiple service domains - this is probably the most significant item and the one most specific to Oracle VM Server for SPARC. It is very widely deployed in production environments as the means to provide network and disk availability, but it can be confusing. Subsequent articles will describe why and how to configure multiple service domains.
Note, for the sake of precision: an I/O domain is any domain that has a physical I/O resource (such as a PCIe bus root complex). A service domain is a domain providing virtual device services to other domains; it is almost always an I/O domain too (so it can have something to serve).
Resources
Here are some important links; we'll be drawing on their content in the next several articles:
Oracle VM Server for SPARC Documentation
Maximizing Application Reliability and Availability with SPARC T5 Servers whitepaper by Gary Combs
Maximizing Application Reliability and Availability with the SPARC M5-32 Server whitepaper by Gary Combs
Summary
Oracle VM Server for SPARC offers features that can be used to provide highly-available environments. This and the following blog entries will describe how to plan and deploy them.