
I’m starting to get a collection of computers at home and to support them I have my „server“ linux box running a raid array. Its currently mdadm raid1, going to raid5 once I have more drives (and then raid6 I’m hoping for). However I’ve heard various stories about data getting corrupted on one drive and you never noticing due to the other drive being used, up until the point when the first drive fails, and you find your second drive is also screwed (and 3rd, 4th, 5th drive). Obviously backups are important and I’m taking care of that also, however I know I’ve previously seen scripts which claim to help against this problem and allow you to check your raid while its running. However looking for these scripts again now I’m finding it hard to find anything which seems simular to what I ran before and I feel I’m out of date and not understanding whatever has changed. How would you check a running raid to make sure all disks are still preforming normally? I monitor SMART on all the drives and also have mdadm set to email me in case of failure but I’d like to know my drives occasionally „check“ themselves too. Thanks in Advance. | |||||
|
The point of RAID with redundancy is that it will keep going as long as it can, but obviously it will detect errors that put it into a degraded mode, such as a failing disk. You can show the current status of an array with
Furthermore the return status of You can also get a quick summary of all RAID device status by looking at In addition to these spot checks, mdadm can notify you as soon as something bad happens. Make sure that you have Make sure that you do receive mail send to root on the local machine (some modern distributions omit this, because they consider that all email goes through external providers, but receiving local mail is necessary for any serious system administrator). Test this by sending root a mail:
| |||
You can force a check of the entire array while it’s online. For example, to check the array on
I also have a cron job that runs the following command once a month:
It’s not a thorough check of the drive itself, but it does force the system to periodically verify that (almost) every file can be read successfully off the disk. Yes, some files are going to be read out of memory cache instead of disk. But I figure that if the file is in memory cache, then it’s successfully been read off disk recently, or is about to be written to disk, and either of those operations will also uncover drive errors. Anyway, running this job tests the most important criterion of a RAID array (“Can I successfully read my data?”) and in the three years I’ve been running my array, the one time I had a drive go bad, it was this command that discovered it. One little warning is that if your filesystem is big, then this command is going to take a long time; my system takes about 6hr/TiB. I run it using
| |||||
|
I use this simple function to check # Health of RAID array raid() { awk '/^md/ {printf "%s: ", $1}; /blocks/ {print $NF}' |