{"id":105,"date":"2013-04-13T19:04:02","date_gmt":"2013-04-14T00:04:02","guid":{"rendered":"http:\/\/www.unfortunateshit.com\/?p=105"},"modified":"2013-04-13T19:04:02","modified_gmt":"2013-04-14T00:04:02","slug":"storage-saga","status":"publish","type":"post","link":"https:\/\/www.unfortunateshit.com\/?p=105","title":{"rendered":"Storage saga"},"content":{"rendered":"<p>A few weeks ago, I noticed that my ZFS array was resilvering, due to a HD failure.\u00a0 This is the first time a drive has failed in my ZFS array, which is a little over 2 years old.\u00a0 My ZFS pool was 99.7% full; an issue I&#8217;ve been meaning to deal with for quite some time now, but have had other priorities.\u00a0 As a result, a resilver (rebuild\/resync in RAID terms) is causing quite a bit of thrashing on the disks.<\/p>\n<p>A bit of background:\u00a0 In early 2011, I built the successor to my 20HD 30TB Raid6 array.\u00a0 It is a 24HD 48TB ZFS RaidZ2 array.\u00a0 RaidZ2 is similar to Raid6, in that two drives are used for parity rather than storage.\u00a0 This means that two drives can fail without losing any data.\u00a0 I knowingly went outside best practices while building it, and put 24 drives in one vdev.\u00a0 Actually 23 drives in the vdev, and one hot spare.\u00a0 A vdev is considered a group of drives within a ZFS pool.\u00a0 Parity is localized to a vdev, and you can have multiple vdevs within a pool (with more parity drives in each vdev).\u00a0 You are only supposed to have up to 9 drives in a RaidZ2 vdev.\u00a0 At the time, it seemed like the only disadvantage to having that many drives in a pool was performance, which was not a huge factor for me.\u00a0 ZFS doesn&#8217;t stripe across drives in a vdev, so your vdev is only as fast as your slowest drive.\u00a0 What I didn&#8217;t consider was the amount of stress on the drives, and overall time it would take for a resilver to complete.\u00a0 Especially when your pool is 99.7% full, and it has to move data around in little tiny chunks.<\/p>\n<p>So back to the drive failure.\u00a0 This should be no big deal.\u00a0 I had a hot spare, which is why the array was resilvering itself, without any intervention from me.\u00a0 By the time I noticed it, the resilver was about 3 hours in.\u00a0 The ETA to complete the resilver was 72 hours from then!\u00a0 That&#8217;s 72 hours of continuous hard disk thrashing, in addition to the normal load caused by the r\/w of the 22 VMs I have running on that array.\u00a0 I shut some of my VMs down, to hopefully speed up the process, and checked back in a few hours.\u00a0 To my horror and dismay, the hot spare failed, and 4 other drives had taken IO errors (and were being resilvered as a result).\u00a0 The array continued to resilver, across the remaining drives, and was still thrashing like hell.\u00a0 Several hours later, 3 more drives had taken IO errors.\u00a0 That&#8217;s 8 drives resilvering, and 2 faulted in the pool.\u00a0 It doesn&#8217;t get much worse than this.\u00a0 I always buy my drives from multiple sources when I build a NAS, so that they will have different dates of manufacture, and are less likely to fail in huge batches.\u00a0 What the hell was going on?\u00a0 All I could figure was that the incredible stress of the resilver was too much for my consumer-grade HDs to handle.<\/p>\n<p>About 10pm that night, it happened.\u00a0 A third drive failed, less than 18 hours into the resilver.\u00a0 Since one of the three was the hot spare, technically only two drives from the original pool had failed, which is the maximum allowed without losing data.\u00a0 At this point I am shitting bricks, and literally can&#8217;t sleep.\u00a0 I shut all my VMs down, and am scrambling to move critical data to other disks in the house.\u00a0 I lit up my old array, which hadn&#8217;t been powered on since we moved into the new house.\u00a0 It wouldn&#8217;t boot!\u00a0 Something was up with the OS on the boot drive, so I booted off of a Ubuntu Live CD and mounted the array.\u00a0 All was fine, but the data (which was originally a backup of what was on the new array) was quite stale.\u00a0 Since 48TB &gt; 30TB, I obviously had to decide what I was willing to lose, and only copy some stuff over.\u00a0 I started using external USB drives, and my desktop machine as temporary storage to move data to, in case another drive failed.\u00a0 The next morning my wife says, &#8220;Dave, is something supposed to be beeping in the furnace room?&#8221;\u00a0 This can only mean one thing.\u00a0 A drive failed on my old array (which has an enterprise RAID card on it, and notifies you when a drive fails).\u00a0 What else could go wrong?\u00a0 Since my VMs were shut down, I did not get an email notification.\u00a0 I hopped on the console and noticed that drive 9 had failed, and the array was rebuilding with drive 10 (the hot spare).\u00a0 The ETA on this rebuild was much less: 10 hours.\u00a0 It completes without incident, and I swap the bad HD with a cold spare that I had on-hand.<\/p>\n<p>Eventually, the resilver completes, in 73hrs.\u00a0 No more drives failed, and I haven&#8217;t lost any data.\u00a0 I&#8217;m relieved, but still incredibly spooked that I could lose it all at any minute, if another drive failed.\u00a0 Throughout all of this, I&#8217;ve been trying to figure out what my long-term plan was going to be.\u00a0 Up until this all happened, I had been considering rebuilding my old array (the one with the hardware RAID card in it) with larger (and more) disks.\u00a0 But now there is critical data copied on that array, and I can&#8217;t scrap it and start over.\u00a0 It seemed like my only option was to build another (third) NAS server, at a considerable expense.\u00a0 I could go ZFS, which requires lots of RAM (expensive), or RAID, which requires a hardware RAID card (expensive).\u00a0 I&#8217;m highly annoyed, because if I had just addressed this a few months ago when I knew I was running out of space, I would not have to build a third server.\u00a0 Then it occurred to me that I could buy a drive enclosure with a built-in SAS expander, and connect it to my existing server.\u00a0 That would require me to upgrade the amount of RAM (ZFS likes RAM), but it was doable.\u00a0 Of course, I would have to scrap my existing RAM, because it was ECC unbuffered, and I had maxed out what my motherboard could handle (48GB).\u00a0 I would have to purchase ECC Registered DIMMS to go beyond the 48GB barrier.\u00a0 I telling my tale of woes to a coworker, and he mentioned that we had a bunch of servers in our warehouse that we weren&#8217;t using, and that he thought they were full of RAM.\u00a0 I checked it out, and they were indeed full of RAM.\u00a0 352GB of ECC Registered RAM, to be exact!\u00a0 So I borrowed 12 8GB DIMMs and put them in my server.\u00a0 Voila!\u00a0 96GB!<\/p>\n<p>I ordered up my enclosure, a Norco DS-24E, and 8 Toshiba 3TB 7200RPM SATA drives.\u00a0 I figured I would start with 8 drives, and expand later on.\u00a0 The enclosure and drives arrived a few days later, and appeared to install without incident.\u00a0 That is, until I realized that all of the drives detected as 2.2TB drives.\u00a0 WTF?\u00a0 Some googling quickly revealed that the LSI SAS1068E chipsets on my SAS controllers did not support 3TB drives!\u00a0 At this point it&#8217;s been over a week since the first drive failed, and I&#8217;m on borrowed time with this array.\u00a0 After a few hours of research, I order a LSI SAS2008 PCI-E SAS HBA.\u00a0 It&#8217;s not the best, or newest, but it is known to work in the very unique configuration I am running (ESXi hardware passthrough to a Solaris VM, to share the array back to ESXi via NFS).\u00a0 I also ordered 8 more 3TB drives, because I realized that because of my only putting 8 drives in a vdev now, I will have much less usable space and still needed more room to temporarily store my data.\u00a0 This is getting quite expensive!<\/p>\n<p>The new controller and drives show up 2 days later, and I begin surgery.\u00a0 It goes surprisingly well.\u00a0 ZFS is fantastically resilient and scalable.\u00a0 After an export\/import, the pool detected perfectly on the new controller, even though all of the drive IDs had changed.\u00a0 I was super relieved at this point.\u00a0 I then added the other 8 drives to the enclosure, and built the new pool as 2 8 drive RaidZ2 vdevs.\u00a0 The pool created without incident.\u00a0 I enabled compression, nfs and smb on the pool, and immediately began copying my data to it.\u00a0 It&#8217;s now been a little over 24 hours, and 20T worth of the data has been copied to it.\u00a0 I intend to get a current copy of everything on the failing array, and scrap it completely.\u00a0 I will then rebuild it with 3 8 drive RaidZ2 vdevs, just like the new array, and forgo the hot spare.\u00a0 I&#8217;ll lose a significant amount of storage (6TB), but this whole event will be much less likely to occur again.\u00a0 Any future resilvering will be limited to 8 drives, instead of 24.\u00a0 Also, my IOPS will be greatly improved, because ZFS stripes across multiple vdevs.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>A few weeks ago, I noticed that my ZFS array was resilvering, due to a HD failure.\u00a0 This is the first time a drive has failed in my ZFS array, which is a little over 2 years old.\u00a0 My ZFS &hellip; <a href=\"https:\/\/www.unfortunateshit.com\/?p=105\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_s2mail":"yes","footnotes":""},"categories":[1],"tags":[],"class_list":["post-105","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/www.unfortunateshit.com\/index.php?rest_route=\/wp\/v2\/posts\/105","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.unfortunateshit.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.unfortunateshit.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.unfortunateshit.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.unfortunateshit.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=105"}],"version-history":[{"count":2,"href":"https:\/\/www.unfortunateshit.com\/index.php?rest_route=\/wp\/v2\/posts\/105\/revisions"}],"predecessor-version":[{"id":107,"href":"https:\/\/www.unfortunateshit.com\/index.php?rest_route=\/wp\/v2\/posts\/105\/revisions\/107"}],"wp:attachment":[{"href":"https:\/\/www.unfortunateshit.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=105"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.unfortunateshit.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=105"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.unfortunateshit.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=105"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}