Read Errors
The final (?) note on the ESXi / HP saga (part 1 / part 2). This is too much of a downer to continue documenting!
I’ll start with a quick tip: If moving data to NFS shares seems slow or gives you frequent timeouts look to your network gear.
I was having issues getting ghettoVCB backups to an NFS share on a Windows 2003 server. The VMware ESXi server would sporadically lose connection to the 2k3 server and then kill the backup. I finally replaced the little DLink SOHO 1GB switch with an HP ProCurve and then replaced all the sketchy old network cables with shiny new CAT 6 cables. The backups became noticeably faster and the intermittent connection losses completely disappeared.
Now I can get good backups for 3 out of the 6 Virtual Machines (VMs) on this server. Using any sort of file copy I can get copies of those same 3.
The other three? I’m starting to lose faith – I think we’re hosed. The copies or backups always end with a series of errors. The log errors point to the datastore where the VMs currently reside, not the copy destination. Read errors. Ugh.
Tip: I can’t seem to get “thin” backups to the Windows (or OpenFiler for that mattter) NFS shares. So, regardless of how much data is actually used in that 150 GB virtual disk, I get a full 150 GB backup file. As a workaround, I turned on NTFS compression for the NFS share at the Windows server. Slows the copy speed down by almost half with barely any extra CPU utilization. Worth it though as it took 280 GB of backups down to 16.5 GB!
I have a VMware forum post out there languishing. It did result in me making sure I had the latest/greatest firmware, ESXi updates and HP tools installed though. It also took me down a few unnecessary paths, but that’s OK as it was educational. I’ll probably close that post soon and try a much shorter and summarized version. I may have to figure out how to contact paid support.
I also tried a ServerFault.com post but I think I tried to cover too much territory in it. Face it, many geeks suffer from tl;dr syndrom. Think I’ll close that topic soon as well.
I am trying to get some help from HP now, but this time they’re not so interested in helping. See, at boot time the P400 array controller gives an error 1716 “unrecoverable media error.” HP says, logically enough, that I need to rebuild the array. OK, I’d like to do that but I want image backups first. They say I should’ve had good backups before I did any drive replacements. Well, that’s a good point. I did… but that was almost two weeks ago! *cough* excuse me… I’d just like fresh backups before I toast the array. These machines are all still in use and the company hasn’t been standing still.
There doesn’t appear to be a chkdsk or fsck for vmfs formatted volumes. Seems like that would be useful.
For web searchers dying to share a cure, below I’ve listed some of the error messages.
GhettoVCB errors are:
- Failed to clone disk : Connection timed out (7208969)
- Failed to clone disk : Input/output error (327689)
Sample message log errors:
Jul 15 19:18:50 vmkernel: 0:17:59:18.890 cpu4:16218)NMP: nmp_CompleteCommandForPath: Command 0x28 (0x4100051614c0) to NMP device "mpx.vmhba1:C0:T1:L0" failed on physical path "vmhba1:C0:T1:L0" H:0x3 D:0x0 P:0x0 Possible sense data: 0x2 0x3a 0x0.
Jul 15 19:18:50 vmkernel: 0:17:59:18.890 cpu4:16218)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe: NMP device "mpx.vmhba1:C0:T1:L0" state in doubt; requested fast path state update...
Jul 15 19:18:50 vmkernel: 0:17:59:18.890 cpu4:16218)ScsiDeviceIO: 770: Command 0x28 to device "mpx.vmhba1:C0:T1:L0" failed H:0x3 D:0x0 P:0x0 Possible sense data: 0x2 0x3a 0x0.
Jul 15 19:18:53 vmkernel: 0:17:59:22.500 cpu4:5365)<4>cciss: cmd 0x4100b1402000 has CHECK CONDITION byte 2 = 0x3
Here’s another set:
Jul 15 19:49:08 vmkernel: 0:18:29:37.553 cpu7:10409)<4>cciss: cmd 0x4100b1402000 has CHECK CONDITION byte 2 = 0x3
Jul 15 19:49:08 vmkernel: 0:18:29:37.559 cpu7:10409)NMP: nmp_CompleteCommandForPath: Command 0x28 (0x4100050fa000) to NMP device "mpx.vmhba1:C0:T1:L0" failed on physical path "vmhba1:C0:T1:L0" H:0x3 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
Jul 15 19:49:08 vmkernel: 0:18:29:37.559 cpu7:10409)WARNING: NMP: nmp_DeviceRequestFastDeviceProbe: NMP device "mpx.vmhba1:C0:T1:L0" state in doubt; requested fast path state update...
Jul 15 19:49:08 vmkernel: 0:18:29:37.559 cpu7:10409)ScsiDeviceIO: 770: Command 0x28 to device "mpx.vmhba1:C0:T1:L0" failed H:0x3 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
Jul 15 19:49:08 vmkernel: 0:18:29:37.559 cpu6:312122)Fil3: 5354: Sync READ error ('EFops_wsus-flat.vmdk') (ioFlags:
: Timeout
Unless I can figure out a way to get those last 3 VMs images or copied – or an alternative way to fix the read errors – I see a long weekend rebuilding machines in my future. Fortunately I can still get all the data from the VMs. I just can’t copy the VMs directly!
Wrestling with ESXi Client Backups
Continued from Fun with ESXi and HP issues. Learning Lessons.
To recap, I had two new drives and a backplane en route and really wanted to get complete image backups of the VM clients running on this VMware ESXi server before I started swapping drives. I’d tried the vSphere built-in file copy utility, FTP and SCP but all had timeout issues on copying the large disk image (.vmdk) files.
While looking for more backup options I recalled reading about ghettoVCB in the past and this seemed like a great time to check it out. This is a very well written – free! – script that you can use to backup ESX(i) clients’ files to direct attached storage (which I’m avoiding for obvious reasons!) as well as NFS or iSCSI volumes. While I don’t have anything currently offering NFS or iSCSI drives I figured I could whip something together easily enough.
First off, I was looking at one of my Windows 2003 file servers. This machine has tons of unused storage and an unused network card so I figured it would be a good place to start. Unused network card is nice to keep all the network traffic for NFS/iSCSI off of the main network.
Hey, did you know that Windows Services for NFS is built-in to Windows 2003 and 2008 Server versions? I didn’t either until last week. Seemed worth a shot so I checked the box on the Windows Component Installer and gave it a shot.
Tip: When prompted for Windows media pay attention to the differences between x32 and x64 versions. I had no issues installing this component from x32 media – alas, the server is x64 and the service definitely didn’t start! Took an uninstall and a reboot to get back on track…
Once I sorted through some “user issues” I was ready to try connecting the ESXi server to the Windows exported NFS share. Sadly, no matter what I tried I couldn’t seem to hit the right combination of options to get this to work.
A few days later I noticed that checking “Microsoft Services for NFS” doesn’t give you all of the NFS service options… In particular, you don’t get the options around authentication and user mapping. This could very well explain why I couldn’t get it working. I will be revisiting this in the near future. I guess the other way to do this is via Windows Services for Unix, something I have very little experience with but might check into.
Time being of essence I set aside the Windows option and moved on to try with my OpenFiler server (first mentioned back in December of ‘09). Enabling the secondary network card and then configuring NFS on it took about 5 minutes. Connecting the ESXi server to it took about another minute. Easy!
I configured ghettoVCB and gave it a try. The first few machines backed up nicely! Alas, the next one had issues – a few file copy restarts and then it finally gave up after a timeout error.
Side note: My OpenFiler server is running an older pair of Hyper-Threaded Xeon 3.0 Ghz processors. When the backup to NFS is running it is seriously bogged down. That caught me by surprise.
After farting around with backups for (literally) over 12 hours I finally opted to accept the fact that I wasn’t going to get them all backed up and went ahead and did the hardware swaps. I figured perhaps the drives being in a degraded state were causing my issues.
Nope.
All the hardware has been replaced. All drives happy, RAID arrays rebuilt and everything is “green” and good to go on the server. I still can’t get reliable ghettoVCB backups and, perhaps related, I still can’t get full copies of the images using vSphere, FTP or SCP either.
At the moment I’m thinking this may be more of an ESXi issue since I can’t get good copies with or without NFS in the mix. Or maybe I need to try iSCSI from the OpenFiler server? I’ll need to spend more time on this one as I think ghettoVCB backups would be invaluable – once I can get all my stuff working.




