When software problems reported are not really software problems
- by AndyUK
Hi
Apologies if this has already been covered or you think it really belongs on wiki.
I am a software developer at a company that manufactures microarray printing machines for the biosciences industry. I am primarily involved in interfacing with various bits of hardware (pneumatics, hydraulics, stepper motors, sensors etc) via GUI development in C++ to aspirate and print samples onto microarray slides.
On joining the company I noticed that whenever there was a hardware-related problem this would cause the whole setup to freeze, with nobody being any the wiser as to what the specific problem was - hardware / software / misuse etc. Since then I have improved things somewhat by introducing software timeouts and exception handling to better identify and deal with any hardware-related problems that arise eg PLC commands not successfully completed, inappropriate FPGA response commands, and various other deadlock type conditions etc. In addition, the software will now log a summary of the specific problem, inform the user and exit the thread gracefully. This software is not embedded, just interfacing using serial ports.
In spite of what has been achieved, non-software guys still do not fully appreciate that in these cases, the 'software' problem they are reporting to me is not really a software problem, rather the software is reporting a problem, but not causing it. Don't get me wrong, there is nothing I enjoy more than to come down on software bugs like a ton of bricks, and looking at ways of improving robustness in any way. I know the system well enough now that I almost have a sixth sense for these things.
No matter how many times I try to explain this point to people, it does not really penetrate. They still report what are essentially hardware problems (which eventually get fixed) as software ones.
I would like to hear from any others that have endured similar finger-pointing experiences and what methods they used to deal with them.