A site for solving at least some of your technical problems...
A site for solving at least some of your technical problems...
I got 4 x 10Tb HDD -- HGST Ultrastar He10 -- click the link to find them on Amazon
One of my RAID1 drives, the important ones, started sending me error email messages about 8 sectors being "unreadable".
So first I waited a bit to see whether it would resolve itself. It did go silent for a few months and came back...
At that point, I decided to purchase a replacement and retire the drive giving me errors. The new drive looks good so far!
To see what the SMART system discovered, you can use the smartctl command like so:
smartctl -a /dev/sdg1
The output is pretty long. The specific field I would look for is the:
Current_Pending_Sector
The number of pending sectors is the same as the number I would get in the emails.
To know whether the hard drive may need to go to the trash, check the:
Reallocated_Event_Count
If that second number is really high, then there are issues on your drive and the SMART system is trying to save data to new blocks. It may just be a small area on the drive which is bust (it could even be a manufacturing issue, although they low format and verify drives before selling them so it's unlikely that you would get such a drive).
The Number of Hours used can also be of interest:
Power_On_Hours
You have about 8,766 hours per year. An HDD can live for about 10 years, so once you reach around 87,000, you may want to consider changing the drive even if it still works just fine... I don't know as much about SSD, it is likely that these have a similar lifetime warranty, however, I think that the number of writes will determine the life of an SSD whereas the HDD is both read & write and just number of hours turned on (especially if the motor runs nearly permanently).
Since I do collect a lot of data with a few tools (spider like, if you wish) I can make use of a 10Tb drive because if it ends up failing I won't be losing anything of any remote importance to me.
But the problem is that the SMART device would continue to send me one email a day about those 8 sectors (on a 10Tb, 8 sectors is really nothing and it has not grown at all since these errors started!)
So I decided to search about how to fix the error. That was complicated. No one had a good idea. I've seen all sorts of commands to do it, but none that I thought would make sense on such a large hard drive. The main problem, to my point of view, is that these 8 sectors where not labeled anywhere. That is, SMART knows errors occured, it just doesn't know where. That means you've got to check the entire drive, probably in read/write mode. There is a test you can run for that purpose and that's what I've done:
sudo fsck -C -f -c -c -y /dev/sdg1
This command (which is definitely not the default fsck command line!) will go through every single block of the specified partition (/dev/sdg1 in my example) and read the block and then write the data back to it. Obviously, for a partition of some 9.7Tb + all the inodes, etc. that means quite a bit of time spent running. For my hard drive, this took about 40 hours.
The -c options is what asks fsck to run the badblocks utility to verify the data of all sectors. If you think the bad blocks could be outside of the partition, then you've got yet another problem and fsck won't suffice, but in my case that worked.
The first -c means "read" and the second -c means "write". Using both was important, although I first tried with a single -c. This was only 20 hours, but it fixed nothing at all.
So I think that what fixed the sectors is the write.
The good news for me is that in the end the disk did not reallocate the sectors anywhere else. That means the drive is probably just fine.
I use a software RAID1 with an md0 device (i.e. the partitions are of type Linux RAID).
To add/remove drives, you use the mdadm command.
Note:
Between mdadm commands, feel free to use the following to verify that what you were trying to do happened:
cat /proc/mdstat
You can quickly format the new drive using the following in your command line:
sfdisk -d /dev/sdf | sfdisk /dev/sdd
The -d option asks sfdisk to print out the partition info to stdout. The following command reads that data and generates the exact same partition on /dev/sdd.
Note that for the md0 device to work, all drives must have the exact same partition(s), hence the command.
Personally, though, I prefer to use fdisk and be in total control of what is happening. There I would first connect to /dev/sdf, use the p command to print out the partition table. Then I would connect to /dev/sdd and replicate the partition from the first drive most likely using the n command.
Unless you get two drives from the exact same batch, it is not unlikely that the number of sectors is going to be slightly different. The one problem is if your newer drive is a bit smaller. Then you've got a big problem. If a little bigger, it is not an issue. A few sectors will be lost.
If you had a drive and it was automatically removed by the OS (which happened to me once) then you can add the new drive to the RAID using:
mdadm --manage /dev/md0 --add /dev/sdd1
Just make sure to use the correct device in /dev/sddX.
If the old drive still appears in the RAID setup then you need to remove it. To do so you need to use two steps: (1) mark it as failed, (2) remove it:
mdadm --manage /dev/md0 --fail /dev/sdd1 mdadm --manage /dev/md0 --remove /dev/sdd1
Note that the --remove command fails if you don't mark the drive as failed. This is because it would break the RAID to just remove a drive and the mdadm command tries to prevent you from making a mistake.
In one case, I did not want to remove the drive that was "breaking down" since it seemed to be mostly working just fine. I never saw I/O errors on it to that point, other than those 8 sectors reported by the SMART system.
So at first, I added my new drive:
sudo mdadm --grow /dev/md0 --raid-devices=3 --add /dev/sdh1
The drive needs to be formatted first to support raid.
Then I waited for the data to be duplicated, which took a few days. I can see by looking at the status in the /proc folder. It will look like the following once done:
$ cat /proc/mdstat Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] md0 : active raid1 sdg1[1] sdf1[0] sdd1[3] 9766302720 blocks super 1.2 [3/3] [UUU] bitmap: 24/73 pages [96KB], 65536KB chunk
Before it's done, there is an extra line showing you the percent completion.
Now we want to remove the failing drive and reduce the size so mdadm doesn't complain about a missing drive:
sudo mdadm /dev/md0 --fail /dev/sdg1 sudo mdadm /dev/md0 --remove /dev/sdg1 sudo mdadm --grow /dev/md0 --raid-devices=2
Notice how we use --grow to reduce the number of devices in the array.
We are required to mark the drive as failing (--fail) before we can remove it. This is a safety measure so you don't end up removing drives that are still viewed as working in your array.
Now you can check the mdstat parameters again and see that your array is again just 2 drives.
Yes.
You can absolutely have a larger drive. However, when you format that new drive, make sure your partition is the same size as the existing array drives.
The one thing you must keep in mind in this case is that the extra space will be "lost". That is, you can't use it without disturbing the good functioning of the hard drive in your array.
i.e. imagine that you create a separate partition and mount it to /tmp. Now your new drive head is going to move to that new partition each time something is read/written to /tmp and that means the RAID speed is impacted dearly.
So it's not that you can't use the extra space, but you will lose on speed if you do. If you really need more space, you probably want to get yet another drive and use it separately. You could easily get 16Tb in one HDD as of 2021, so saving just 1 or 2 Tb on a RAID drive would not be that useful. If you're really tight money wise, it's still a solution for you, but again, the impact on your RAID is going to be really high.