Impact of Server Failure on Coherence Request Processing
- by jpurdy
Requests against a given cache server may be temporarily blocked for several seconds following the failure of other cluster members. This may cause issues for applications that can not tolerate multi-second response times even during failover processing (ignoring for the moment that in practice there are a variety of issues that make such absolute guarantees challenging even when there are no server failures).
In general, Coherence is designed around the principle that failures in one member should not affect the rest of the cluster if at all possible. However, it's obvious that if that failed member was managing a piece of state that another member depends on, the second member will need to wait until a new member assumes responsibility for managing that state. This transfer of responsibility is (as of Coherence 3.7) performed by the primary service thread for each cache service. The finest possible granularity for transferring responsibility is a single partition. So the question becomes how to minimize the time spent processing each partition.
Here are some optimizations that may reduce this period:
Reduce the size of each partition (by increasing the partition count)
Increase the number of JVMs across the cluster (increasing the total number of primary service threads)
Increase the number of CPUs across the cluster (making sure that each JVM has a CPU core when needed)
Re-evaluate the set of configured indexes (as these will need to be rebuilt when a partition moves)
Make sure that the backing map is as fast as possible (in most cases this means running on-heap)
Make sure that the cluster is running on hardware with fast CPU cores (since the partition processing is single-threaded)
As always, proper testing is required to make sure that configuration changes have the desired effect (and also to quantify that effect).