Been a heckuva week with some some server and software issues. I’m on the downhill side of it now though so thought I’d share the tale and some lessons learned. I’ve been comfortable with ESXi for a while now, but nothing like things going bad to provide the opportunity to learn so much more!
Our patient was an HP ProLiant DL380 server that runs VMware ESXi as part of our nascent virtualization initiative. It used to be a database server and I’ve been quite pleased with how it handles running the majority of our development and QA servers. The pair of quad core CPUs rarely break a sweat.
Last week I happened to notice one of the hard drives (4 of which are in a RAID 1+0 array) had a solid amber LED lit. That is indicative of a failure so I called my friends at HP and they over-nighted a drive. I popped into the office Sunday afternoon and hot-swapped the old drive for the new one. Saw it spin up and go green and it seemed things were on track. I checked an hour or so later and it was still green and busy (presumably rebuilding the array) so I headed out of the office for a family function.
I came in Tuesday morning (2 days later) and was surprised to see the new drive — and another one two bays over — blinking amber. The green activity lights on both drives would sporadically flicker with activity but the blinking amber indicated a “Predictive Failure” is nigh.
HP wanted me to run diagnostics and send them reports so I scheduled an outage for Tuesday afternoon/evening. I figured that before I ran (potentially destructive) diagnostics I should maybe try to get some copies of the VM clients. See, I tend to treat each VM client just like any other server when it comes to backups. They all run backup software and enjoy nightly data backups. I’ve just never taken advantage of the fact that the VM client is really just a pile of files. And hey, if I lose the array I’d sure rather recover from an image of the server and then restore backups vs. building an all new server, patching it and then restoring from backups.
Don’t worry: I was shutting each machine down before trying to copy it.
I fired up the vSphere Client app and ran the built in Datastore Browser. From there it is pretty simple to copy the machine directories. Simple and painfully slow! While looking for other options I stumbled over an article about enabling SSH on ESXi servers.
Tip: While on the console of your ESXi server hit alt + F1 and type “unsupported” and hit enter. You won’t see while you’re typing that but this will get you into the unsupported “Tech Support Mode” console. Very handy.
Tip: If you inadvertantly type “exit” in that console it appears to shut down. Toggling between alt + F1 and (the default) alt + F2 won’t help. Instead, while on the “dead” console, hit a bunch of enters. I don’t know how many, but more than 3… then type “unsupported” and hit enter again. Back in business!
Now that I had SSH working I fired up trusty WinSCP and tried pulling the files with it. Hmm, nope. That’s not any faster.
Tip: If using WinSCP, change the encryption cipher to Blowfish instead of the default AES for a bit of performance boost.
Some more digging turned up the fact that you can run an FTP server on the ESXi box. So, starting to panic about progress I gave that a shot next. Definitely faster but I wasn’t getting complete files. For one machine I might get a 20 GB .vmdk file, but then for the next I’d only get 9 out of 100 GB. It was inconsistent and frustrating since in some cases it would take an hour before crapping out. I tried watching the message log but it was so full of I/O errors that it wasn’t hardly worth using. At times it was scrolling by faster than I could read it!
Tip: Hit alt + F12 on the console to get to the ESXi “live” message log. You can scroll around all you want to examine errors but be aware that will stop the updates. Hit the spacebar when ready for it to resume updating.
I finally had to give up on copying server images. I just couldn’t get good copies and I’m sure it was related to the degraded drives (at least, I hope so!). I realized I was going to have to gut it out with the existing backups.
To run diags I had to boot the server from an HP SmartStart CD. While looking for a current version of that CD I stumbled over an update released last month: ** Critical ** Firmware CD Supplemental Online ROM Flash Component for Linux – Smart Array P400 and P400i. Huh. Critical, eh? I grabbed that update and added it to my bootable USB key with the Smart Update Firmware DVD image.
Once booted from SmartStart I went into Maintenance mode and collected the array diag reports and then went to the Diagnostics and ran some complete diagnostics. The HP tech wanted me to run the complete 5 times… heck, it took over 2 hours to do it twice. I called that good enough. Saved every report option and then called HP back.
They reviewed the reports and agreed things did indeed look broke. Both drives were logging tons of hard read and write faults and S.M.A.R.T. was having fits (thus the predicted failure…). The also agreed that I should do that critical firmware flash. I did. They deemed that good enough and ordered me a pair of drives and a backplane card just in case it was a hardware fault (hey, I won’t turn down extra hardware!).
Now, I don’t know if was the reboots post hot-swap or that firmware update but when I brought the server back up I sure had a LOT fewer I/O errors in the ESXi log so that was progress. However, I still had two unhappy drives.
While waiting for the parts to show up I’d have a day to continue trying to get better client images.
(to be continued)