Oracle Coherence, Split-Brain and Recovery Protocols In Detail
- by Ricardo Ferreira
This article provides a high level conceptual overview of Split-Brain scenarios in distributed systems. It will focus on a specific example of cluster communication failure and recovery in Oracle Coherence. This includes a discussion on the witness protocol (used to remove failed cluster members) and the panic protocol (used to resolve Split-Brain scenarios).
Note that the removal of cluster members does not necessarily indicate a Split-Brain condition. Oracle Coherence does not (and cannot) detect a Split-Brain as it occurs, the condition is only detected when cluster members that previously lost contact with each other regain contact.
Cluster Topology and Configuration
In order to create an good didactic for the article, let's assume a cluster topology and configuration. In this example we have a six member cluster, consisting of one JVM on each physical machine. The member IDs are as follows:
Member ID
IP Address
1
10.149.155.76
2
10.149.155.77
3
10.149.155.236
4
10.149.155.75
5
10.149.155.79
6
10.149.155.78
Members 1, 2, and 3 are connected to a switch, and members 4, 5, and 6 are connected to a second switch. There is a link between the two switches, which provides network connectivity between all of the machines.
Member 1 is the first member to join this cluster, thus making it the senior member. Member 6 is the last member to join this cluster. Here is a log snippet from Member 6 showing the complete member set:
2010-02-26 15:27:57.390/3.062 Oracle Coherence GE 3.5.3/465p2 <Info> (thread=main, member=6): Started DefaultCacheServer...
SafeCluster: Name=cluster:0xDDEB
Group{Address=224.3.5.3, Port=35465, TTL=4}
MasterMemberSet
(
ThisMember=Member(Id=6, Timestamp=2010-02-26 15:27:58.635, Address=10.149.155.78:8088, MachineId=1102, Location=process:228, Role=CoherenceServer)
OldestMember=Member(Id=1, Timestamp=2010-02-26 15:27:06.931, Address=10.149.155.76:8088, MachineId=1100, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:511, Role=CoherenceServer)
ActualMemberSet=MemberSet(Size=6, BitSetCount=2
Member(Id=1, Timestamp=2010-02-26 15:27:06.931, Address=10.149.155.76:8088, MachineId=1100, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:511, Role=CoherenceServer)
Member(Id=2, Timestamp=2010-02-26 15:27:17.847, Address=10.149.155.77:8088, MachineId=1101, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:296, Role=CoherenceServer)
Member(Id=3, Timestamp=2010-02-26 15:27:24.892, Address=10.149.155.236:8088, MachineId=1260, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:32459, Role=CoherenceServer)
Member(Id=4, Timestamp=2010-02-26 15:27:39.574, Address=10.149.155.75:8088, MachineId=1099, Location=process:800, Role=CoherenceServer)
Member(Id=5, Timestamp=2010-02-26 15:27:49.095, Address=10.149.155.79:8088, MachineId=1103, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:3229, Role=CoherenceServer)
Member(Id=6, Timestamp=2010-02-26 15:27:58.635, Address=10.149.155.78:8088, MachineId=1102, Location=process:228, Role=CoherenceServer)
)
RecycleMillis=120000
RecycleSet=MemberSet(Size=0, BitSetCount=0
)
)
At approximately 15:30, the connection between the two switches is severed:
Thirty seconds later (the default packet timeout in development mode) the logs indicate communication failures across the cluster. In this example, the communication failure was caused by a network failure. In a production setting, this type of communication failure can have many root causes, including (but not limited to) network failures, excessive GC, high CPU utilization, swapping/virtual memory, and exceeding maximum network bandwidth. In addition, this type of failure is not necessarily indicative of a split brain. Any communication failure will be logged in this fashion. Member 2 logs a communication failure with Member 5:
2010-02-26 15:30:32.638/196.928 Oracle Coherence GE 3.5.3/465p2 <Warning> (thread=PacketPublisher, member=2): Timeout while delivering a packet; requesting the departure confirmation for Member(Id=5, Timestamp=2010-02-26 15:27:49.095, Address=10.149.155.79:8088, MachineId=1103, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:3229, Role=CoherenceServer)
by MemberSet(Size=2, BitSetCount=2
Member(Id=1, Timestamp=2010-02-26 15:27:06.931, Address=10.149.155.76:8088, MachineId=1100, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:511, Role=CoherenceServer)
Member(Id=4, Timestamp=2010-02-26 15:27:39.574, Address=10.149.155.75:8088, MachineId=1099, Location=process:800, Role=CoherenceServer)
)
The Coherence clustering protocol (TCMP) is a reliable transport mechanism built on UDP. In order for the protocol to be reliable, it requires an acknowledgement (ACK) for each packet delivered. If a packet fails to be acknowledged within the configured timeout period, the Coherence cluster member will log a packet timeout (as seen in the log message above). When this occurs, the cluster member will consult with other members to determine who is at fault for the communication failure. If the witness members agree that the suspect member is at fault, the suspect is removed from the cluster. If the witnesses unanimously disagree, the accuser is removed. This process is known as the witness protocol. Since Member 2 cannot communicate with Member 5, it selects two witnesses (Members 1 and 4) to determine if the communication issue is with Member 5 or with itself (Member 2). However, Member 4 is on the switch that is no longer accessible by Members 1, 2 and 3; thus a packet timeout for member 4 is recorded as well:
2010-02-26 15:30:35.648/199.938 Oracle Coherence GE 3.5.3/465p2 <Warning> (thread=PacketPublisher, member=2): Timeout while delivering a packet; requesting the departure confirmation for Member(Id=4, Timestamp=2010-02-26 15:27:39.574, Address=10.149.155.75:8088, MachineId=1099, Location=process:800, Role=CoherenceServer)
by MemberSet(Size=2, BitSetCount=2
Member(Id=1, Timestamp=2010-02-26 15:27:06.931, Address=10.149.155.76:8088, MachineId=1100, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:511, Role=CoherenceServer)
Member(Id=6, Timestamp=2010-02-26 15:27:58.635, Address=10.149.155.78:8088, MachineId=1102, Location=process:228, Role=CoherenceServer)
)
Member 1 has the ability to confirm the departure of member 4, however Member 6 cannot as it is also inaccessible. At the same time, Member 3 sends a request to remove Member 6, which is followed by a report from Member 3 indicating that Member 6 has departed the cluster:
2010-02-26 15:30:35.706/199.996 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=2): MemberLeft request for Member 6 received from Member(Id=3, Timestamp=2010-02-26 15:27:24.892, Address=10.149.155.236:8088, MachineId=1260, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:32459, Role=CoherenceServer)
2010-02-26 15:30:35.709/199.999 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=2): MemberLeft notification for Member 6 received from Member(Id=3, Timestamp=2010-02-26 15:27:24.892, Address=10.149.155.236:8088, MachineId=1260, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:32459, Role=CoherenceServer)
The log for Member 3 determines how Member 6 departed the cluster:
2010-02-26 15:30:35.161/191.694 Oracle Coherence GE 3.5.3/465p2 <Warning> (thread=PacketPublisher, member=3): Timeout while delivering a packet; requesting the departure confirmation for Member(Id=6, Timestamp=2010-02-26 15:27:58.635, Address=10.149.155.78:8088, MachineId=1102, Location=process:228, Role=CoherenceServer)
by MemberSet(Size=2, BitSetCount=2
Member(Id=1, Timestamp=2010-02-26 15:27:06.931, Address=10.149.155.76:8088, MachineId=1100, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:511, Role=CoherenceServer)
Member(Id=2, Timestamp=2010-02-26 15:27:17.847, Address=10.149.155.77:8088, MachineId=1101, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:296, Role=CoherenceServer)
)
2010-02-26 15:30:35.165/191.698 Oracle Coherence GE 3.5.3/465p2 <Info> (thread=Cluster, member=3): Member departure confirmed by MemberSet(Size=2, BitSetCount=2
Member(Id=1, Timestamp=2010-02-26 15:27:06.931, Address=10.149.155.76:8088, MachineId=1100, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:511, Role=CoherenceServer)
Member(Id=2, Timestamp=2010-02-26 15:27:17.847, Address=10.149.155.77:8088, MachineId=1101, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:296, Role=CoherenceServer)
); removing Member(Id=6, Timestamp=2010-02-26 15:27:58.635, Address=10.149.155.78:8088, MachineId=1102, Location=process:228, Role=CoherenceServer)
In this case, Member 3 happened to select two witnesses that it still had connectivity with (Members 1 and 2) thus resulting in a simple decision to remove Member 6.
Given the departure of Member 6, Member 2 is left with a single witness to confirm the departure of Member 4:
2010-02-26 15:30:35.713/200.003 Oracle Coherence GE 3.5.3/465p2 <Info> (thread=Cluster, member=2): Member departure confirmed by MemberSet(Size=1, BitSetCount=2
Member(Id=1, Timestamp=2010-02-26 15:27:06.931, Address=10.149.155.76:8088, MachineId=1100, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:511, Role=CoherenceServer)
); removing Member(Id=4, Timestamp=2010-02-26 15:27:39.574, Address=10.149.155.75:8088, MachineId=1099, Location=process:800, Role=CoherenceServer)
In the meantime, Member 4 logs a missing heartbeat from the senior member. This message is also logged on Members 5 and 6.
2010-02-26 15:30:07.906/150.453 Oracle Coherence GE 3.5.3/465p2 <Info> (thread=PacketListenerN, member=4): Scheduled senior member heartbeat is overdue; rejoining multicast group.
Next, Member 4 logs a TcpRing failure with Member 2, thus resulting in the termination of Member 2:
2010-02-26 15:30:21.421/163.968 Oracle Coherence GE 3.5.3/465p2 <D4> (thread=Cluster, member=4): TcpRing: Number of socket exceptions exceeded maximum; last was "java.net.SocketTimeoutException: connect timed out"; removing the member: 2
For quick process termination detection, Oracle Coherence utilizes a feature called TcpRing which is a sparse collection of TCP/IP-based connections between different members in the cluster. Each member in the cluster is connected to at least one other member, which (if at all possible) is running on a different physical box. This connection is not used for any data transfer, only heartbeat communications are sent once a second per each link. If a certain number of exceptions are thrown while trying to re-establish a connection, the member throwing the exceptions is removed from the cluster. Member 5 logs a packet timeout with Member 3 and cites witnesses Members 4 and 6:
2010-02-26 15:30:29.791/165.037 Oracle Coherence GE 3.5.3/465p2 <Warning> (thread=PacketPublisher, member=5): Timeout while delivering a packet; requesting the departure confirmation for Member(Id=3, Timestamp=2010-02-26 15:27:24.892, Address=10.149.155.236:8088, MachineId=1260, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:32459, Role=CoherenceServer)
by MemberSet(Size=2, BitSetCount=2
Member(Id=4, Timestamp=2010-02-26 15:27:39.574, Address=10.149.155.75:8088, MachineId=1099, Location=process:800, Role=CoherenceServer)
Member(Id=6, Timestamp=2010-02-26 15:27:58.635, Address=10.149.155.78:8088, MachineId=1102, Location=process:228, Role=CoherenceServer)
)
2010-02-26 15:30:29.798/165.044 Oracle Coherence GE 3.5.3/465p2 <Info> (thread=Cluster, member=5): Member departure confirmed by MemberSet(Size=2, BitSetCount=2
Member(Id=4, Timestamp=2010-02-26 15:27:39.574, Address=10.149.155.75:8088, MachineId=1099, Location=process:800, Role=CoherenceServer)
Member(Id=6, Timestamp=2010-02-26 15:27:58.635, Address=10.149.155.78:8088, MachineId=1102, Location=process:228, Role=CoherenceServer)
); removing Member(Id=3, Timestamp=2010-02-26 15:27:24.892, Address=10.149.155.236:8088, MachineId=1260, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:32459, Role=CoherenceServer)
Eventually we are left with two distinct clusters consisting of Members 1, 2, 3 and Members 4, 5, 6, respectively. In the latter cluster, Member 4 is promoted to senior member.
The connection between the two switches is restored at 15:33. Upon the restoration of the connection, the cluster members immediately receive cluster heartbeats from the two senior members. In the case of Members 1, 2, and 3, the following is logged:
2010-02-26 15:33:14.970/369.066 Oracle Coherence GE 3.5.3/465p2 <Warning> (thread=Cluster, member=1): The member formerly known as Member(Id=4, Timestamp=2010-02-26 15:30:35.341, Address=10.149.155.75:8088, MachineId=1099, Location=process:800, Role=CoherenceServer) has been forcefully evicted from the cluster, but continues to emit a cluster heartbeat; henceforth, the member will be shunned and its messages will be ignored.
Likewise for Members 4, 5, and 6:
2010-02-26 15:33:14.343/336.890 Oracle Coherence GE 3.5.3/465p2 <Warning> (thread=Cluster, member=4): The member formerly known as Member(Id=1, Timestamp=2010-02-26 15:30:31.64, Address=10.149.155.76:8088, MachineId=1100, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:511, Role=CoherenceServer) has been forcefully evicted from the cluster, but continues to emit a cluster heartbeat; henceforth, the member will be shunned and its messages will be ignored.
This message indicates that a senior heartbeat is being received from members that were previously removed from the cluster, in other words, something that should not be possible. For this reason, the recipients of these messages will initially ignore them. After several iterations of these messages, the existence of multiple clusters is acknowledged, thus triggering the panic protocol to reconcile this situation. When the presence of more than one cluster (i.e. Split-Brain) is detected by a Coherence member, the panic protocol is invoked in order to resolve the conflicting clusters and consolidate into a single cluster. The protocol consists of the removal of smaller clusters until there is one cluster remaining. In the case of equal size clusters, the one with the older Senior Member will survive. Member 1, being the oldest member, initiates the protocol:
2010-02-26 15:33:45.970/400.066 Oracle Coherence GE 3.5.3/465p2 <Warning> (thread=Cluster, member=1): An existence of a cluster island with senior Member(Id=4, Timestamp=2010-02-26 15:27:39.574, Address=10.149.155.75:8088, MachineId=1099, Location=process:800, Role=CoherenceServer) containing 3 nodes have been detected. Since this Member(Id=1, Timestamp=2010-02-26 15:27:06.931, Address=10.149.155.76:8088, MachineId=1100, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:511, Role=CoherenceServer) is the senior of an older cluster island, the panic protocol is being activated to stop the other island's senior and all junior nodes that belong to it.
Member 3 receives the panic:
2010-02-26 15:33:45.803/382.336 Oracle Coherence GE 3.5.3/465p2 <Error> (thread=Cluster, member=3): Received panic from senior Member(Id=1, Timestamp=2010-02-26 15:27:06.931, Address=10.149.155.76:8088, MachineId=1100, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:511, Role=CoherenceServer) caused by Member(Id=4, Timestamp=2010-02-26 15:27:39.574, Address=10.149.155.75:8088, MachineId=1099, Location=process:800, Role=CoherenceServer)
Member 4, the senior member of the younger cluster, receives the kill message from Member 3:
2010-02-26 15:33:44.921/367.468 Oracle Coherence GE 3.5.3/465p2 <Error> (thread=Cluster, member=4): Received a Kill message from a valid Member(Id=3, Timestamp=2010-02-26 15:27:24.892, Address=10.149.155.236:8088, MachineId=1260, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:32459, Role=CoherenceServer); stopping cluster service.
In turn, Member 4 requests the departure of its junior members 5 and 6:
2010-02-26 15:33:44.921/367.468 Oracle Coherence GE 3.5.3/465p2 <Error> (thread=Cluster, member=4): Received a Kill message from a valid Member(Id=3, Timestamp=2010-02-26 15:27:24.892, Address=10.149.155.236:8088, MachineId=1260, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:32459, Role=CoherenceServer); stopping cluster service.
2010-02-26 15:33:43.343/349.015 Oracle Coherence GE 3.5.3/465p2 <Error> (thread=Cluster, member=6): Received a Kill message from a valid Member(Id=4, Timestamp=2010-02-26 15:27:39.574, Address=10.149.155.75:8088, MachineId=1099, Location=process:800, Role=CoherenceServer); stopping cluster service.
Once Members 4, 5, and 6 restart, they rejoin the original cluster with senior member 1. The log below is from Member 4. Note that it receives a different member id when it rejoins the cluster.
2010-02-26 15:33:44.921/367.468 Oracle Coherence GE 3.5.3/465p2 <Error> (thread=Cluster, member=4): Received a Kill message from a valid Member(Id=3, Timestamp=2010-02-26 15:27:24.892, Address=10.149.155.236:8088, MachineId=1260, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:32459, Role=CoherenceServer); stopping cluster service.
2010-02-26 15:33:46.921/369.468 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=4): Service Cluster left the cluster
2010-02-26 15:33:47.046/369.593 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Invocation:InvocationService, member=4): Service InvocationService left the cluster
2010-02-26 15:33:47.046/369.593 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=OptimisticCache, member=4): Service OptimisticCache left the cluster
2010-02-26 15:33:47.046/369.593 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=ReplicatedCache, member=4): Service ReplicatedCache left the cluster
2010-02-26 15:33:47.046/369.593 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=DistributedCache, member=4): Service DistributedCache left the cluster
2010-02-26 15:33:47.046/369.593 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Invocation:Management, member=4): Service Management left the cluster
2010-02-26 15:33:47.046/369.593 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=4): Member 6 left service Management with senior member 5
2010-02-26 15:33:47.046/369.593 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=4): Member 6 left service DistributedCache with senior member 5
2010-02-26 15:33:47.046/369.593 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=4): Member 6 left service ReplicatedCache with senior member 5
2010-02-26 15:33:47.046/369.593 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=4): Member 6 left service OptimisticCache with senior member 5
2010-02-26 15:33:47.046/369.593 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=4): Member 6 left service InvocationService with senior member 5
2010-02-26 15:33:47.046/369.593 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=4): Member(Id=6, Timestamp=2010-02-26 15:33:47.046, Address=10.149.155.78:8088, MachineId=1102, Location=process:228, Role=CoherenceServer) left Cluster with senior member 4
2010-02-26 15:33:49.218/371.765 Oracle Coherence GE 3.5.3/465p2 <Info> (thread=main, member=n/a): Restarting cluster
2010-02-26 15:33:49.421/371.968 Oracle Coherence GE 3.5.3/465p2 <D5> (thread=Cluster, member=n/a): Service Cluster joined the cluster with senior service member n/a
2010-02-26 15:33:49.625/372.172 Oracle Coherence GE 3.5.3/465p2 <Info> (thread=Cluster, member=n/a): This Member(Id=5, Timestamp=2010-02-26 15:33:50.499, Address=10.149.155.75:8088, MachineId=1099, Location=process:800, Role=CoherenceServer, Edition=Grid Edition, Mode=Development, CpuCount=2, SocketCount=1) joined cluster "cluster:0xDDEB" with senior Member(Id=1, Timestamp=2010-02-26 15:27:06.931, Address=10.149.155.76:8088, MachineId=1100, Location=site:usdhcp.oraclecorp.com,machine:dhcp-burlington6-4fl-east-10-149,process:511, Role=CoherenceServer, Edition=Grid Edition, Mode=Development, CpuCount=2, SocketCount=2)
Cool isn't it?