There comes a time in every system administrator's life that the dreaded hard disk failure occurs. Unfortunately, I just lived this nightmare, but the outcome was not as bad as it could have been.
The Scene
It was a Tuesday after a 3-day weekend--a Monday-wanna-be. A main web server had just died and refused to boot. Upon mounting the RAID 1 root partition, a never-ending stream of hard disk unrecoverable and invalid op-codes errors filled the screen.
Attempt 1
Not a big deal, I thought to myself. I'll just shutdown, disconnect the bad drive, and let the mirrored partition take over. Success was on the horizon--the computer POST'd, the kernel loaded, the drives were mounted, services began to start, and finally success--a login prompt! Unfortunately, the hostname was completely wrong. This "mirrored" partition, although perfectly sized to mirror the bad root partition, conveniently held a completely separate Redhat installation. Strike 1.
Attempt 2
Next, I figured I'd just repair the damaged partition with e2fsck. Unfortunately, the hard disk errors simply filled the screen. Strike 2.
Attempt 3
Now things are looking bleak. The RAID 1 is really RAID non-existant. The existing hard drive doesn't cooperate with basic mounting and partition tools. There are backups of critical data, but some are 3 weeks old. In addition, the lack of service for several major web application and the work to rebuild from scratch made the situation even worse.
My last resort was to get raw with this disk. It was time to bring in every *nix newbies dreaded tool: dd. AKA delete disk or destroy data. dd is just a simple byte copying utility that can be used in powerful ways. My goal was to copy the damaged disk byte by byte while ignoring any read errors. A simple test of dd if=/dev/hdb3 of=testfile.out bs=1024 count=512 successfully gave me the boot sector in the output file, so I knew there was some hope. Fortunately, there exist tools which work the same as dd, but include features to make data recovery more convenient.
The tool that saved my day was ddrescue by Antonio Diaz. The goal was to copy raw bytes from the damaged partition over to the perfectly sized "mirrored" partition. ddrescue took about 6 hours but produced a partition on the good disk that was mountable. After perusing the contents, I was confident the recovery went well. The statistics at the end of the recovery showed 145GB recovered and 1MB lost. Although this sounds fantastic, keep in mind that this 1MB touched many files and directories. Missing a few bytes of the file system structure could render critical data files lost.
The Home Stretch
The next step was to disconnect the bad drive, and attempt a boot with the restored partition. Upon boot, I was welcomed with the typical root password prompt forcing me to manually run fsck to check the file system. I knew the errors and corrections made to the file system during this step would directly affect how successful the recovery was. I must have said 'yes' to over 2000 errors affecting hundreds of files. Although many were important, I did not see anything related to critical databases on the server. Upon completing fsck, I rebooted once again to finally see the correct login prompt. After logging in, I found that all databases were intact and most of the websites running Apache and Tomcat came back online. The only applications truly lost were default installations of two application that can be reinstalled easily given the databases survived.
Overall, this was a success story, and I thank the author of ddrescue for helping me conquer this wanna-be Monday. In addition, there are two shiny new servers with real hardware RAID, redundant fans, and redundant power supplies soon to be on the way to replace the aging server hardware.
Comments
very good blog about Hard disk disasters. This blog gives information about Hard disk disasters. Now I am having nice information about Free Registry scan for systems, This is very good for cleaning registry.
Hard disk disasters are just my "favorites", the last time I heard about them I was in a computer service for data recovery because I lost important files on my hard drive. I had some few good nightmare hours but luckily I the story had a happy ending. Since then it's really difficult to trust hard disks like I used to...
When a friend of mine who owns a Company had to recover the data in the RAID and was suggested by a staff member that he could do it himself with the help of a Recovery Software. But he was not in a state of taking Risk because the data inside was too much of Imortance to him and then after a bit of investigation we came across http://www.diskdoctors.com/raid-recovery.asp , this was Disk Doctors RAID Recovery which helped him recover all his data.
If you have extremely critical information on a crash hard disk,
then professional recovery services are definitely the way to go. In
our situation, we had backups that were about 3 weeks old. There would
have been significant data loss, but nothing that couldn't be
considered acceptable loss or regenerated by hand or from paper.
Also,
the recovery method I used did not attempt to write to the damaged hard
disk at all. The dd command simply attempts to read and copy what
bytes could be successfully read to another disk. The only risk would
be if the disk went abnormally haywire and starting writing garbage to
the disk. However, this is extremely unlikely. This is different than
copying data from a mounted partition which could possibly cause writes
to a occur on the disk due to file access times being updated. That
method on a damaged disk is definitely more risky.
Post new comment