Situation: a piece of software reads frames of data from a file in a seperate thread and puts it on a queue, emptied by another thread. That second thread periodically checks on the queue and fails rather gracefully, by showing an error message stating the read timed out, if no data is available within a certain amount of time. Initially this timeout was set to 200mSec. There was no real reasoning behind that constant though, but it worked fine. We measured on a couple of machines and for large data frames, larger than what would be used by customers, a read took like 20mSec whith no other load on the machine.
However one customer now gets timeout errors now and then (on the second try all is fine, probably the file is in cache or the virus scanner leaves it alone). The programmers are like 'well, yeah, but that customer's machine is full of cruft, virus scanners, tons of unneeded background processes etc'. Of course the customer is like 'hey this should just work, shouldn't it'? While the programers have a point, since the software is heavy enough to validate the need for a dedicated machine, that does not make the customer happy.
Increasing the timeout to 2 seconds, for example, solves the problem. But I'd like to make a proper decision now instead of just randomly pick some magic constant that is probably ok in 99% of cases. What criteria should be used for that? We could just pick a large number, but that feels wrong. (and then we end up with a program that has the horrible bahaviour of hanging when trying to read from a disconnected drive for instance, whereas we'd rather make it show an error right away). Or we could make the timeout value a user setting, but then we need to ducument it clearly and even then not all customers are tech savy enough to really understand what it does. Or we could try and wait until another customer reports timeouts and increase the value again. And again. Until we find something ok for 99.99% of the cases.. Any good practice for this type of situation?