My production cluster had the repair service enabled since april 16th with the default 9 days time to completion and repairs would complete properly. However, since may 22nd, it is being disabled automatically by Opscenter:
From /var/log/opscenter/opscenterd.log:
[...]
2014-06-03 21:13:47-0400 [zs_prod] ERROR: Repair task (<Node 10.1.0.22='6417880425364517165'>, (-4019838962446882275L, -4006140687792135587L), set(['zs_logging', 'OpsCenter'])) timed out after 3600 seconds.
2014-06-03 22:16:44-0400 [zs_prod] ERROR: Repair task (<Node 10.1.0.22='6417880425364517165'>, (-4006140687792135587L, -4006140687792135586L), set(['zs_logging', 'OpsCenter'])) timed out after 3600 seconds.
2014-06-03 22:16:44-0400 [zs_prod] ERROR: More than 100 errors during repair service, shutting down repair service
2014-06-03 22:16:44-0400 [zs_prod] INFO: Stopping repair service
[...]
From /var/log/opscenter/repair_service/zs_prod.log:
[...]
2014-06-03 22:16:44-0400 [zs_prod] ERROR: Repair task (<Node 10.1.0.22='6417880425364517165'>, (-4006140687792135587L, -4006140687792135586L), set(['zs_logging', 'OpsCenter'])) timed out after 3600 seconds.
2014-06-03 22:16:44-0400 [zs_prod] ERROR: Task (<Node 10.1.0.22='6417880425364517165'>, (-4006140687792135587L, -4006140687792135586L), set(['zs_logging', 'OpsCenter'])) has failed 1 times.
2014-06-03 22:16:44-0400 [zs_prod] ERROR: 101 errors have ocurred out of 100 allowed.
2014-06-03 22:16:44-0400 [zs_prod] ERROR: More than 100 errors during repair service, shutting down repair service
2014-06-03 22:16:44-0400 [zs_prod] INFO: Stopping repair service
On the nodes on which the repair fails, from /var/log/cassandra/system.log:
ERROR [RMI TCP Connection(93502)-10.1.0.22] 2014-06-03 20:12:28,858 StorageService.java (line 2560) Repair session failed:
java.lang.IllegalArgumentException: Requested range intersects a local range but is not fully contained in one; this would lead to i
mprecise repair
at org.apache.cassandra.service.ActiveRepairService.getNeighbors(ActiveRepairService.java:164)
at org.apache.cassandra.repair.RepairSession.<init>(RepairSession.java:128)
at org.apache.cassandra.repair.RepairSession.<init>(RepairSession.java:117)
at org.apache.cassandra.service.ActiveRepairService.submitRepairSession(ActiveRepairService.java:97)
at org.apache.cassandra.service.StorageService.forceKeyspaceRepair(StorageService.java:2620)
at org.apache.cassandra.service.StorageService$5.runMayThrow(StorageService.java:2556)
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
These errors, which only occurs if the repair service is running, are the only errors these nodes experience. Outside of the repair task, the Cassandra cluster works perfectly.
I am running Opscenter 4.1.2 with a 6 nodes DSE 4.0.2 cluster installed on linux virtual machines. The nodes run a vanilla installation of Ubuntu Server 12.04 64-bit and DSE was installed and secured according to the provided installation documentation.
I have been experiencing that problem on my development cluster for a while too (with DSE 4.0.0, 4.0.1 and 4.0.2), but I thought this was because of some configuration error on my part. The problem has appeared spontaneously at some point too.
The Cassandra cluster has been working very smoothly with a good write throughput. It is very stable and has enough resources to work with. We did not notice any problems with the applications that depend on it.