Testing fault tolerant code

Posted by Robert on Stack Overflow See other posts from Stack Overflow or by Robert
Published on 2010-05-03T09:09:27Z Indexed on 2010/05/03 9:18 UTC
Read the original article Hit count: 388

Filed under:
|

I’m currently working on a server application were we have agreed to try and maintain a certain level of service. The level of service we want to guaranty is: if a request is accepted by the server and the server sends on an acknowledgement to the client we want to guaranty that the request will happen, even if the server crashes. As requests can be long running and the acknowledgement time needs be short we implement this by persisting the request, then sending an acknowledgement to the client, then carrying out the various actions to fulfill the request. As actions are carried out they too are persisted, so the server knows the state of a request on start up, and there’s also various reconciliation mechanisms with external systems to check the accuracy of our logs.

This all seems to work fairly well, but we have difficult saying this with any conviction as we find it very difficult to test our fault tolerant code. So far we’ve come up with two strategies but neither is entirely satisfactory:

  • Have an external process watch the server code and then try and kill it off at what the external process thinks is an appropriate point in the test
  • Add code the application that will cause it to crash a certain know critical points

My problem with the first strategy is the external process cannot know the exact state of the application, so we cannot be sure we’re hitting the most problematic points in the code. My problem with the second strategy, although it gives more control over were the fault takes, is I do not like have code to inject faults within my application, even with optional compilation etc. I fear it would be too easy to over look a fault injection point and have it slip into a production environment.

© Stack Overflow or respective owner

Related posts about fault-tolerance

Related posts about testing