I have a dozen or so single board computers on a network running Debian (squeeze) and access them via ssh (ssh server is dropbear). To give an idea of the hardware of these computers they're 1.2 GHz x86 processors, 1GB of RAM and 4GB flash drives formatted as ext2 (I avoided ext3 to prevent the added flash write stress from journaling), there is also a swap partition on the drive.
Normally the setup I'm using works great and I can access all the computers. Every once in a while one will prevent access. What happens is I try to connect via ssh (putty) and it gives me the login prompt, I enter the username and password and it responds 'Access Denied' and it will also refuse any public key in ~/.ssh/authorized_keys. The credentials are correct as they worked previously. The computer responds to pings and putty recognizes the server public key, which implies to me the system is still running. Restarting the server fixes the problem and I can log in again. (I tried a temporary fix of putting shutdown -r now in the root crontab but this doesn't seem to reliably be run once the hang happens) Once I restart however there doesn't seem to be any information in any of the system logs to indicate what happened, the logs are simply empty for that time period, as if the system had crashed.
There is some custom software running on the system which appears to stop working (which is why I wanted to ssh to begin with). I'm assuming that this program is the source of the problems but I'm unsure of how it would cause it and how to debug what is happening.
The most likely explanation I can think of is that there is a memory leak in the other program that then prevents dropbear from spawning a new login shell (and crontab from executing shutdown) as there is not enough free memory. But looking at memory usage of the other (working) computers there doesn't seem to be any meaningful increase in memory to indicate a leak (unless it's a very big, fast acting and rare leak). I would think that when the OS ran out of memory it would restart the system or kill processes (the Linux kernel restarts right?). The other thing I wonder about is if the fact that they are running off a flash drive could have some effect, especially the swap partition (which I think I should remove to prevent wear of the flash), but the flash drives are young (~1 month) and I don't think that wear would be a factor yet.
Does anybody have an idea of what could cause these symptoms, if it could be done by a memory leak, or something else I haven't thought of. And does anybody know of a method to try to debug the problem and find out more information about what's going wrong?