It was supposed to be a routine ESXi upgrade. Having discovered that one of the NetApp's OnTap software had a memory leak which was causing a controller to slow down and eventually fail over every 3-6 months, a pre-requisite of updating the OnTap software was to upgrade the ESXi hosts to at least version 5.
Having been doing a number of these upgrades recently, my confidence was high as I submitted the change requests, notified stakeholders and completed the upgrade.
And then the phone calls started.
“The Servers are unresponsive”. Indeed they were. Not all of them though, just some of them, and only for short periods of time of up to 5 minutes. They weren't crashing, but they weren't working either.
Chasing a red herring, we thought that the upgrade of ESXi hosts had increased the severity of the known memory leak issue and started troubleshooting, trying to ease the load on the affected controller. It wasn't until later in the day that the true cause was found.
VMware KB Article 2016122 was discovered as the root cause. It turns out this is a bug in the OnTap software and, you guessed it, can be resolved by upgrading the OnTap software. At this point we are stuck in a position of either causing a major outage by bringing forward the OnTap Software upgrade, which we were told we could not do, or rolling back to ESXi 4.1, which would mean we would be unable to upgrade the OnTap software.
Thankfully there is also a workaround involving reducing the NFS MaxQueueDepth to limit the I/O. The PureStorageGuy has an excellent article about this.
A number of factors brought on the perfect storm, being:
- Upgrading ESXi instead of a clean install.
- The version of the NetApp OnTap Software.
- The protocol to access the Datastore.
In all it turned a routine upgrade into a bit of a nightmare and I'm fairly sure I gained a few more grey hairs as a result.