Harddisk woes

I was busy programming a few days ago when the machine froze for a few seconds, followed by an error message from the Intel Matrix RAID controller than one of the harddisks in my RAID1 (mirrored) pair had failed. Damn. This is the second time this has happened on this machine in the 2.5 years I have had it. I don’t seem to have much luck with harddisks. It might not be coincidence that it happened on one of the hottest days of the year. I removed the defective disk and put in an identical spare I had bought for such an eventually and rebuilt the RAID1 pair from the surviving harddisk. I felt quite pleased with myself.

A  couple of days later the same error message appeared. The new disk had apparently failed. Double damn. I rebooted a couple of times. No joy. It seems unlikely that an unused disk would fail within 48 hours, perhaps it is the RAID controller? I updated to the latest Intel Matrix RAID driver and swapped the two disks around. It still wouldn’t recognize the newly added harddisk, so it seems the new disk really is defective. I swapped the working disk with the harddisk that had failed a couple of days ago. The ‘failed’ harddisk booted OK! Something strange going on here.

I could probably send the failed disk back to Seagate, but I am simply not prepared to risk my sensitive data to save myself £50. I tried to order another identical harddisk but, inevitably, the identical model isn’t available 2.5 years later. The disks are:

SEAGATE BARRACUDA 7200.10 ST3500630AS 500GB 7200RPM 16MB SATA-300 3.5"

Apparently the .10 is the generation number (thanks to Dennis on the ASP forums for that).

I am currently running the machine on the one good harddisk, being very conscientious about my backups. I am undecided what to do next.

  1. Order a 7200.12 disk and see what happens when I plug it in.
  2. Replace the RAID controller. I believe the Intel Matrix RAID controller is firmware on a chip on the motherboard, so replacing it doesn’t sound like much fun. And it isn’t clear that it is the cause of the problem.
  3. Buy a new PC. This one is only 2.5 years old and it means stumping up a load of cash and all the hassle of moving everything over. I would rather wait until Windows 7 is released before I buy consider buying a new machine (I am thinking about getting someone like overclockers.co.uk to build me a lean, mean, 64-bit, compiling machine).

Option 1 sounds like the easiest and cheapest options. Any other ideas? Is it safe to pair a 7200.10 and a 7200.12 of the same size for RAID1?

16 thoughts on “Harddisk woes

  1. Andy Parkes

    If this were me i’d buy TWO of the 7200.12 disks

    That way you don’t have worry about pairing with the .10 – if you have a problem in the future you’ll wonder if it was because you mixed disks

    Then i’d use disk imaging software (my preference is Symantec System Recovery) to mirror the disk that currently is working onto the new RAID array

    I’d probably then use the imaging software to schedule backups a couple of times a day onto a USB disk as an extra precaution

    Hope this helps

  2. Andy Brice Post author

    >If this were me i’d buy TWO of the 7200.12 disks

    And a spare? That is starting to get pricey!

    If I put in 2 disks, one after the other, I could use the RAID software to do the imaging.

    >I’d probably then use the imaging software to schedule backups a couple of times a day onto a USB disk as an extra precaution

    I am doing it daily to alternating USB disk.

  3. Andy Parkes

    I wasn’t looking at cost to be honest – just my opinion on the solution

    If costs of drives is a problem is RAID really the answer? Surely the idea is that if a drive fails you “just replace it” as they are supposedly “inexpensive” (I know in the real world especially at a small business level this isn’t quite true!)

    You could buy a .12 drive to go with .10 you already have and one of those could fail next week (or next month or next year) and you’d be buying a new one and spare again

    If this were me and cost of disks were a problem this is what i’d do

    Single disk system
    Spare drive
    Copy of imaging software (my preference being System Recovery)

    I’d do a base image every morning to the USB disk and then an incremental every hour (or less depending on the size of the disks)

    If the single drive fails put the spare straight in and image back to it

    Though i do appreciate you’d lose the benefit of continuous uptime in event of failure – however the fact your PC froze when a drive failed also negates actually having RAID in the first place?

    Just opinions – there is always more than one way to skin a cat :-)

  4. Justin Dolezy

    Ah, Seagate. You’d not heard of their recent problems? http://www.theinquirer.net/inquirer/news/1050374/seagate-barracudas-7200-11-failing or http://www.computerworld.com/action/article.do?command=viewArticleBasic&articleId=9126280 Although you say your drives are 2.5yrs old so wont be that issue..

    I recently had one of their 1.5Tb drives fail on me, after I’d put some 300Gb of data on it.. Maybe a change of manufacturer would be prudent for a while?! Personnally I’ve never had any problems with Western Digital drives, you might want to try them.

    Also it’s usually a good idea to not use drives just from one manufacturer. If you’ve bought two drives at the same time and there’s problems with a series then if one fails you’re just going to get stressed about when the other will fail!

  5. Andy Brice Post author

    Justin,

    I have had WD, Seagate and Maxtor drives fail on me in the past.

    I hadn’t heard about the recent Seagate firmware problems. On that basis I think I might avoid Seagate drives this time.

  6. Andy Brice Post author

    I have fitted a 500GB/7200 RPM/SATA Hitachi DeskStar as the second disk. I hope it lasts longer than the last Seagate.

  7. Thomas

    Andy, does the device where the disks are mounted measure the temperature? If so, it might be worth installing a software that notifies you when it’s getting too warm. For PCs I know that this is possible (already got that warning today ;-).

  8. Andy Brice Post author

    The disks were both hot. I would estimate about 40-50C. I believe that is fairly normal. I am not aware there is any sort of temperature sensor (it is a Dell Dimension 9200), unless there is one inside the harddisk?

    1. Justin Dolezy

      Andy, I’m pretty sure most hard drives these days have SMART built in [http://en.wikipedia.org/wiki/S.M.A.R.T.] Some motherboards can monitor the disks – you’ll need to have a tinker in the BIOS. There’s probably lots of apps that can report the info if present, SpeedFan is one I’ve used in the past [http://www.almico.com/speedfan.php]. HTH

  9. Andy Brice Post author

    Thanks for the tip Justin. I installed Speedfan v4.38. Unfortunately it doesn’t seem to see either of the HardDisks.

  10. Jim

    Well, I didn’t read your advise about not working on a backup product until today, so here’s a link to a new Linux backup product:

    http://sites.google.com/site/hashbackup

    Keeps multiple versions, easy retention policies (last 7 days + last 3 months for example), can mount the backup as a filesystem, can send data offsite via FTP, S3, ssh, or IMAP.

    Give it a try before your hard drive dies!

  11. Eric

    I know it’s a bit of an old post now, but I thought I’d share my thoughts anyway.

    Thanks for posting the drive model number, it’s important:

    SEAGATE BARRACUDA 7200.10 ST3500630AS 500GB 7200RPM 16MB SATA-300 3.5″

    Notice the “AS” designation, i.e. archival storage. It’s a consumer-grade desktop drive. For use with RAID, you want the “NS” model drives.

    As with most things, there are different levels of build quality and features, reliability, etc. with different models of drives.

    If you’re using RAID, you want a drive that’s intended for use with RAID; even if it’s “host-based” or software RAID, and not hardware RAID. There is a real difference. And I suspect this might be related to the troubles you’ve had.

    Consumer-class drives not intended for use in RAID systems can actually induce failures, i.e. you might not have experienced the problems if you didn’t use the drive in a RAID configuration to begin with.

    Check out Seagate’s and Western Digital’s “enterprise-class” drives for models intended for use with RAID; for Seagate, look for drives that end with “NS”, nearline storage (Barracuda ES.2 Hard Drives); for Western Digital, look for drives in the “RE3” and “RE4” line.

    The nice thing about these drives, and SATA in general, is they’re not that much more expensive. I just bought a 250GB Seagate NS drive (that has the “SN06” firmware :-) for use with my hardware RAID controller for $70. I bought a matching Western Digital RE3 drive to use with it in a RAID one mirror for about the same price.

    Yes, all manufacturers have build issues with different lots and issues with QA. I decided to hedge my bets and buy two similar, enterprise-class drives from different manufacturers; and it didn’t cost much more than buying the cheaper, consumer-grade desktop drives.

    Hope this helps, -Eric

  12. Andy Brice Post author

    Eric,

    That is interesting to know. I wish Dell had told me this when I bough a PC with RAID.

    Why are the demands greater on the drives in a RAID1 array? Is it because they are doing additional checking?

  13. Jim

    I think the main issue is with vibration: when you have multiple drives in a chassis, and drive A does a seek, drive B doing a seek can vibrate the chassis enough that it knocks drive A off cylinder and either the read fails (not so bad – it can be retried), or the write is a little bit off cylinder. The write being “off” is bad, because when you try to read that record and get accurately positioned on cylinder, you increase the chance of getting a read error.

    The “RAID certified” drives have extra electronics to monitor head movement after the seek is complete and compensate for vibrational forces if necessary during the read/write phase. I’ve read that these drives undergo more thorough defect testing, which is likely true since it would take more time during manufacturing and therefore should cost more. I’ve also read that the Western Digital “Black” drives are the same as the RE3/4 drives, except that they haven’t gone through the extra testing and are therefore cheaper. The Black drives aren’t advertised as having the RAID positioning features, so it’s not clear whether they are really the same as the RE3/4.

    While the vendors advertise this as a RAID thing, I think the same reliability issues apply anytime you put multiple drives in a chassis and expect to be accessing them concurrently.

    Jim

    HashBackup LLC

  14. Eric

    Yes, that’s really disappointing that Dell included those drives. They should know better.

    Sorry, I didn’t save any of the links; I searched for awhile and couldn’t find any good ones that specifically answer your question.

    I think Jim explained it really well; vibration tolerance is a big one in multi-drive setups. Error-correction/detection and build quality/components are other areas, along with more testing/QA.

    Another difference is how well-implemented, or not, the drive software/firmware is, and its compatibility with the SATA spec. and the commands sent to/from the controller/drive. These are the differences that make or break compatibility with RAID configurations, especially with hardware controllers. The manufacturers test with certain drives and then certify/publish certain drives as compatible.

    Here’s a link to the Seagate ES.2 line:

    Click to access ds_barracuda_es_2.pdf

    Western Digital RE3 page:
    http://www.wdc.com/en/products/Products.asp?DriveID=503
    “RAID-specific, time-limited error recovery (TLER) – Prevents drive fallout caused by the extended hard drive error-recovery processes common to desktop drives.”

    -Eric

  15. Pingback: Speccing my dream development PC « Successful Software

Comments are closed.