Wednesday, August 27, 2008

Computers - unpredictable creatures

Computers are unpredictable beasts. You would think they would be more deterministic, but reality is otherwise.

I have a server with a tape drive. We've used it for about a year, most days. Then suddenly we start getting errors. At first we thought it was a bad tape, but then multiple tapes started giving us grief. Easy enough, just use a different drive. I finally got around to debugging it last week. Swapped the drives over - still errors. Turned out to be a bad cable. That's a new one - I've not seen a SCSI cable fail like that before. (Usually they fail straight away or when you change something, not after working stably and untouched for the best part of a year.)

Yesterday I set up SNMP on some machines for monitoring purposes. Pointed the monitoring system at them, and a couple of minutes later a couple stop responding. That wasn't part of the plan. So I go to the LOM interface, and they're powered off. Call the datacenter, they haven't done anything. I have seen strange things, but snmp (running unprivileged, I might add) powering a machine off when queried? So I tell them to power themselves back on. One comes up fine, the other boots but no ZFS filesystems or zones. I try format. No SAN disks. And then:
# fcinfo hba-port
No Adapters Found.
Yikes! It had a couple of fiber-channel HBAs in it a few minutes ago.

I still don't know what happened, but some electrical gremlins had gotten into the works. So the machines had obviously shut themselves off due to lack of power. And I'm guessing that the PSUs were capable of supplying just enough power to boot the machine, but not enough to get the HBAs powered up properly. Another new failure mode to go in the book.

No comments: