What part of SMF is likely broken by a hard power down?
- by David Mackintosh
At one of my customer sites, the local guy shut down their local Solaris 10 x86 server, pulled the power inputs, moved it, and now it won’t start properly. It boots and then presents a prompt which lets you log in. This appears to be single user milestone (or equivalent).
Digging into it, I think that SMF isn’t permitting the system to go multi-user. SMF was generating a ton of errors on autofs, after some fooling with it I got it to generate errors on inetd and nfs/client instead. This all tells me that the problem is in some SMF state file or database that needs to be fixed/deleted/recreated or something, but I don’t know what the actual issue is.
By “generate errors”, I mean that every second I get a message on the console saying “Method or service exit timed out. Killing contract <#.” This makes interacting with the computer difficult.
Running svcs –xv shows the service as “enabled”, in state “disabled”, reason “Start method is running”. Fooling with svcadm on the service does nothing, except confirm that the service is not in a Maintenance state.
Logs in /lib/svc/log/$SERVICE just tell you that this loop has been happening once per second. Logs in /etc/svc/volatile/$SERVICE confirm that at boot the service is attempted to start, and immediately stopped, no further entries. Note that system-log isn’t starting because system-log depends on autofs so I have no syslog or dmesg.
Googling all these terms ends up telling me how to debug/fix either autofs or nfs/client or inetd or rpc/gss (which was the dependency that SMF was using as an excuse to prevent nfs/client from “starting”, it was claiming that rpc/gss was “undefined” which is incorrect since this all used to work. I re-enabled it with inetadm, but inetd still won’t start properly). But I think that the problem is SMF in general, not the individual services.
Doing a restore_repository to the “manifest_import” does nothing to improve, or even detectibly change, the situation. I didn’t use a boot backup because the last boot(s) were not useful.
I have told the customer that since the valuable data directories are on a separate file system (which fsck’s as clean so it is intact) we could just re-install solaris 10 on the / partition. But that seems like an awfully windows-like solution to inflict on this problem.
So. Any ideas what piece is broken and how I might fix it?