NetApp autosupport rocks
It’s no secret that I love a good many things about NetApp. One of the absolute top reasons I continue to buy and recommend NetApp filers is their fantastic tech support, in particular the “AutoSupport” system built into the filer. Here’s an example.
- 6:27PM - I racked a full disk shelf into the cabinet of an existing filer. I hooked everything up (fibre channel connections, power), set the shelf ID (5, in this case) and made sure termination was on for the shelf. Powered the shelf up and confirmed that all status lights were green.
- 6:29PM - I flipped termination on the previous last shelf (ID 4) off. The fibre loops now extended up to the new shelf and the filer instantly picked up the additional 14 disks and built the redundant loop paths to them. The filer also, however, checked the status of each disk and discovered that a bunch of the disks were previously in another “aggregate” (NetApp speak for a pool of disks protected by the same RAID layout). My filer knew this was a foreign aggregate (i.e. it was created on another filer), and although it confirmed that all the disks required by the aggregate were available on the shelf, it put the aggregate into an offline state to protect what could be valuable data, and generated an error message.
Now, you might be thinking “well, that’s nice. you added the shelf live and the machine actually did some intelligent things before blindly making the disks available for use and potentially destroying a bunch of data. oh and you got a nice error message.”. And you would be correct. It is nice.
But the filer didn’t just generate an error on the console. It also sent the Systems team an email outlining the problem — “Hey! I just discovered a pile-o-disks that belong to another system! You now have a volume here that’s offline!”. It also sent an HTTP post to open a trouble ticket in our internal issue tracker. And it automatically opened a support case with NetApp detailing the error (also via a HTTPS post). Which leads me to the next event in the timeline:
- 6:37PM - My phone rings and NetApp informs me that they’ve received an autosupport from the filer indicating “RAID VOLUME FAILED”. I confirm that I’ve received the email as well. He says “it looks like you’ve added shelf, is that correct?”. Why yes, yes it is. He then says “Ok, just making sure. Do you know how to online a foreign aggregate, or are you planning on using those disks to be part of an existing volume?”. I confirm that I’m going to blow the disks away and buddy makes sure I know the “aggr destroy” command. Next question from NetApp guy: “From the autosupport it looks like one of the ESH modules had an old firmware. The system tried to upgrade it, but failed. I’ve opened an RMA, do you want the part delivered immediately, or can we send it over in the morning?”
Now, ignoring my failure to spot the second autosupport message indicating the shelf firmware upgrade attempt had failed (my brain must have wrote it off as being another message about the offline volume..), is that awesome or what?!? Not only did I get immediate over the phone confirmation of the problem which made me feel much more secure about issuing the commands to reclaim the disks, but it caught a problem that I hadn’t even picked up on.
This is so much smarter behaviour than my IBM FAStT or Sun / Dot Hill storage it’s not even funny. And it’s just the tip of the iceberg of why dealing with “smarter” NetApp filers is soooo much better than the braindead products of many of their competitors. Ever since I first saw autosupport in action I wondered how long it would be before the other vendors took the same approach. Here we are, a decade later, and I continue to be well served by autosupport and continue to be astonished that it’s still largely a NetApp exclusive.
Bottom line: NetApp autosupport rocks, and saves my ass yet again.
6 Comments
Jump to comment form | comments rss | trackback uri