Fun with ESXi and HP issues. Learning Lessons

08 Jul 2010
July 8, 2010

Been a heckuva week with some some server and software issues. I’m on the downhill side of it now though so thought I’d share the tale and some lessons learned. I’ve been comfortable with ESXi for a while now, but nothing like things going bad to provide the opportunity to learn so much more!

Our patient was an HP ProLiant DL380 server that runs VMware ESXi as part of our nascent virtualization initiative. It used to be a database server and I’ve been quite pleased with how it handles running the majority of our development and QA servers. The pair of quad core CPUs rarely break a sweat.

Last week I happened to notice one of the hard drives (4 of which are in a RAID 1+0 array) had a solid amber LED lit. That is indicative of a failure so I called my friends at HP and they over-nighted a drive. I popped into the office Sunday afternoon and hot-swapped the old drive for the new one. Saw it spin up and go green and it seemed things were on track. I checked an hour or so later and it was still green and busy (presumably rebuilding the array) so I headed out of the office for a family function.

I came in Tuesday morning (2 days later) and was surprised to see the new drive — and another one two bays over — blinking amber. The green activity lights on both drives would sporadically flicker with activity but the blinking amber indicated a “Predictive Failure” is nigh.

HP wanted me to run diagnostics and send them reports so I scheduled an outage for Tuesday afternoon/evening. I figured that before I ran (potentially destructive) diagnostics I should maybe try to get some copies of the VM clients. See, I tend to treat each VM client just like any other server when it comes to backups. They all run backup software and enjoy nightly data backups. I’ve just never taken advantage of the fact that the VM client is really just a pile of files. And hey, if I lose the array I’d sure rather recover from an image of the server and then restore backups vs. building an all new server, patching it and then restoring from backups.

Don’t worry: I was shutting each machine down before trying to copy it.

I fired up  the vSphere Client app and ran the built in Datastore Browser. From there it is pretty simple to copy the machine directories. Simple and painfully slow! While looking for other options I stumbled over an article about enabling SSH on ESXi servers.

Tip: While on the console of your ESXi server hit alt + F1 and type “unsupported” and hit enter. You won’t see while you’re typing that but this will get you into the unsupported “Tech Support Mode” console. Very handy.

Tip: If you inadvertantly type “exit” in that console it appears to shut down. Toggling between alt + F1 and (the default) alt + F2 won’t help. Instead, while on the “dead” console, hit a bunch of enters. I don’t know how many, but more than 3… then type “unsupported” and hit enter again. Back in business!

Now that I had SSH working I fired up trusty WinSCP and tried pulling the files with it. Hmm, nope. That’s not any faster.

Tip: If using WinSCP, change the encryption cipher to Blowfish instead of the default AES for a bit of performance boost.

Some more digging turned up the fact that you can run an FTP server on the ESXi box. So, starting to panic about progress I gave that a shot next. Definitely faster but I wasn’t getting complete files. For one machine I might get a 20 GB .vmdk file, but then for the next I’d only get 9 out of 100 GB. It was inconsistent and frustrating since in some cases it would take an hour before crapping out. I tried watching the message log but it was so full of I/O errors that it wasn’t hardly worth using. At times it was scrolling by faster than I could read it!

Tip: Hit alt + F12 on the console to get to the ESXi “live” message log. You can scroll around all you want to examine errors but be aware that will stop the updates. Hit the spacebar when ready for it to resume updating.

I finally had to give up on copying server images. I just couldn’t get good copies and I’m sure it was related to the degraded drives (at least, I hope so!). I realized I was going to have to gut it out with the existing backups.

To run diags I had to boot the server from an HP SmartStart CD. While looking for a current version of that CD I stumbled over an update released last month: ** Critical ** Firmware CD Supplemental Online ROM Flash Component for Linux – Smart Array P400 and P400i. Huh. Critical, eh? I grabbed that update and added it to my bootable USB  key with the Smart Update Firmware DVD image.

Once booted from SmartStart I went into Maintenance mode and collected the array diag reports and then went to the Diagnostics and ran some complete diagnostics. The HP tech wanted me to run the complete 5 times… heck, it took over 2 hours to do it twice. I called that good enough. Saved every report option and then called HP back.

They reviewed the reports and agreed things did indeed look broke. Both drives were logging tons of hard read and write faults and S.M.A.R.T. was having fits (thus the predicted failure…). The also agreed that I should do that critical firmware flash. I did. They deemed that good enough and ordered me a pair of drives and a backplane card just in case it was a hardware fault (hey, I won’t turn down extra hardware!).

Now, I don’t know if was the reboots post hot-swap or that firmware update but when I brought the server back up I sure had a LOT fewer I/O errors in the ESXi log so that was progress. However, I still had two unhappy drives.

While waiting for the parts to show up I’d have a day to continue trying to get better client images.

(to be continued)

WhatsUp Gold Engineer’s Toolkit

30 Jun 2010
June 30, 2010

WhatsUp Gold Engineer's Toolkit The folks at Ipswitch released their WhatsUp Gold Engineer’s Toolkit this week. I was in the beta for the past month and find it a pretty handy ‘kit.

What I hadn’t realized was that they’d be giving it away for free when launched. Nice!

So hey, if you do any network work go get yourself a copy. It won’t replace any single specialized tool (like nmap) but it certainly bundles a lot of functionality in one application.

Here’s what you get:

  • Design & Planning
    • Subnet Calculator
  • Discovery
    • Ping Sweep
    • Port Scanner
    • MAC Address Discovery
  • Diagnostics
    • Ping
    • Trace Route
    • WAN Load Generator
    • Spam Blacklist
    • SNMP Grapher
  • DNS Verification
    • DNS Audit
    • DNS & Whois Resolver
    • DNS Analyzer
  • Remote Control
    • Wake On LAN
    • Remote TCP Session Reset

Each tool opens up in a separate tab so you can have multiple things going at the same time which is handy.

As a network engineer, at least part of your average workday is spent diagnosing and troubleshooting existing problems or investigating issues that could cause new problems. These activities often involve the tedious process of accessing individual network elements to gather information on device or subnet configuration and availability; interpreting that data; and then accessing the same elements again to provision or configure devices and services.

Most of the time you don’t need a powerhouse application to support these activities. In fact, many network engineers use a mix of tools they’ve cobbled together and then rely on brainpower and intuition to make up the difference.  Although this strategy works, there is an easier way to get the job done.

OpenDNS FamilyShield – Easy Mode

24 Jun 2010
June 24, 2010

OpenDNS FamilyShield

The OpenDNS folks made a pretty big announcement yesterday when they added “FamilyShield” to their lineup. This is like “easy mode” for home computer protection and is a pretty exciting product.

If you are already an OpenDNS user, this probably isn’t for you. But if you have friends or family that aren’t all that technical and need help this could be just the ticket. They can keep little Johnny or Suzy safe (or grandpa…) and not have to worry about software to buy and manage. Just set up the home router/firewall with the FamilyShield DNS server addresses and you’re ready to go. No cost, no account, no configuration, no maintenance. It all just works.

What does FamilyShield Block? The service blocks pornographic content, including our “Pornography,” “Tasteless,” and “Sexuality” categories, in addition to proxies and anonymizers (which can render filtering useless). It also blocks phishing and some malware.

For the more technical or tweakers, be aware that there’s no customization or tweaking either. For the average family that’s probably not going to be a big deal. If it is big deal then they can just step up to the “normal” free OpenDNS offering and get a few more options.

I think this is a great idea. I have a few family members that I will be switching over to this very soon.

Learn more at the OpenDNS FamilyShield product page.

A Quick Look at Flock 2.0 Beta

18 Jun 2010
June 18, 2010

flock-button-200x60The latest beta of Flock’s “social browser” is out now and I thought I’d give it a look this week. I’ve checked it out in the past but generally been pretty underwhelmed… In fact, I don’t think I’ve ever written about it before. However, this latest version is based on Chromium which is the same code running Google’s Chrome browser and I’m rather bull’ish on Chrome.

Here’s the pitch:

Flock is faster, simpler, and more friendly. Literally. It’s the only sleek, modern web browser with the built-in ability to keep you up-to-date with your Facebook andTwitter friends.

The install is simple, as is the initial startup experience. I created a flock account (seemed to be necessary to use the social features) and then pointed it towards my Twitter and Facebook accounts. The browser started and I have this nifty little sidebar on the lift that has all the activity from both services mixed in.

Using the browser is, for the most part, just like using Google Chrome with slightly different fonts and colors. Basically, this seems like another Chromium port with a sidebar so I looked a little closer at the sidebar.

I don’t love it.

OK, I don’t hate it, but it is missing a few features that really bug me.

  1. No Twitter Groups support. The Flock sidebar does support groups and offers some really cool features. But. No access to your pre-existing twitter groups
  2. No Facebook comments. Yes, you see Facebook status updates from your friends, but you don’t see any comments or “likes” that might be on those updates. You have to click them (one by one) to see if anyone commented. Ugh.

Odd design decision? The twitter stream shows everything, including replies where you don’t know both parties. Old school! I’m on the fence if that’s good or bad.

If you really like Chrome and want some social integration, then give Flock 2.0 beta a look – unless you’re big into setting up Twitter groups! I’m going to hold off for now and see what subsequent betas bring to the table. Perhaps there’s more to come that will make me a bit more excited to switch over.