(side note: I decided to try actually using my blog for stuff like this; expect more.)
Background: I have a large home media server, previously housed in a Norco 4U rackmount case; in the interests of being able to move it, I rebuilt it a while ago into an NZXT H2 tower case. I was very pleased with the outcome; the machine is reasonably compact, extremely quiet, and housed 14 drives with no problem. All of the SATA cables were purchased as close to the right length as possible, and I custom-made all of the drive power cables to eliminate clutter and maximize airflow.
When it came time to move the whole thing up to Seattle, I had the drives packed separately from the case, but both sets of things were damaged. The case itself is dented by the power supply, but it otherwise fine. One of the drives sounds like it had a head crash, and another one was banged around enough that part of its circuit board was smashed up. Replacing the only visibly smashed component (an SMT inductor) on the board didn’t fix things up.
Other background: I was running unRAID on the server, a commercial distribution of linux designed for home media servers. It uses a modified form of RAID-4, where it has a dedicated drive for parity, but it doesn’t stripe the filesystems on the data drives. This means the write performance is about 25% of a single drive’s throughput, but it can spin drives down that aren’t in use. It also means that, while it has single-drive redundancy like RAID-4 or 5, losing two drives doesn’t mean you lose everything; just (at most) two drives’ worth.
Well, I wasn’t interested in losing two drives worth of stuff. The head-crash drive (3TB Seagate) was clearly a lost cause; at best I’d be able to use it for spare parts for fixing other drives of the same model in the future. The smashed drive, however, had hope. I had another of the same model (Samsung 2TB), and swapping the circuit board between them meant that the smashed drive was about 80% working. (This trick normally requires swapping the drive’s 8-pin BIOS chip, but Samsung drives are more forgiving.)
So, I grabbed whatever spare drives I could, and set about cloning the 80% of the 2TB drive that I could. I used ddrescue for this, which is great — it copies whatever data it can, with whatever retry settings you give it, and keeps a log of what it’s accomplished, so it can resume, or retry later, or retry from a clone (great for optical media). I used it to clone what could be read off of the Samsung drive onto a replacement, and then used its “fill” mode to write “BADSECTOR” over every part of the replacement drive that hadn’t been copied successfully. I then brought up the system in maintenance mode, with the replacement 2TB clone and blank 3TB replacement for the head-crash drive. I had to recreate the array settings (unRAID won’t let you replace two drives at once), but then let the system rebuild the 3TB drive from parity. (Mid-process, one of the other drives threw a few bad sectors. I used ddrescue to copy that disk to /dev/null, and kept the log of the bad sectors. I then used fill-mode to write “BADSECTOR” over the failed sections, forcing them to be reallocated.)
Once the 3TB drive was rebuilt, I then used the ddrescue log files to write “BADSECTOR” on the just-rebuilt drive as well, because areas that were rebuilt off of failed sectors on other drives weren’t to be trusted. (This involved scripting some sector-math, since the partition offset of the drives weren’t the same, and unRAID calculated parity across partitions, not drives.) After that, I fsck’d the 3 drives involved, and then grepped through all files on all of them looking for BADSECTOR, thereby identifying whichever files could no longer be trusted.
This didn’t include files that were just outright missing; I didn’t have a complete list of files, but for the video files at least, I was able to determine what was missing by loading up the sqlite database used by Plex Media Server, which indexed all of those.
In the end, everything was working again, with the lost data reduced down to about 10% of what it would have been. It did get me thinking about changing out the server software, though; but that’s another post.