A site for solving at least some of your technical problems...
A site for solving at least some of your technical problems...
Whenever a disk starts failing you get errors telling you that your ATA is timing out or returning some other kind of errors.
Here is an example of errors I was getting in /var/log/syslog with Ubuntu 22.04:
Feb 22 07:47:13 monster kernel: [87309.624764] ata6.00: exception Emask 0x0 SAct 0x2000 SErr 0x0 action 0x0 Feb 22 07:47:13 monster kernel: [87309.624780] ata6.00: irq_stat 0x40000008 Feb 22 07:47:13 monster kernel: [87309.624789] ata6.00: failed command: READ FPDMA QUEUED Feb 22 07:47:13 monster kernel: [87309.624794] ata6.00: cmd 60/08:68:30:ea:08/00:00:56:00:00/40 tag 13 ncq dma 4096 in Feb 22 07:47:13 monster kernel: [87309.624794] res 41/40:00:34:ea:08/00:00:56:00:00/00 Emask 0x409 (media error) <F> Feb 22 07:47:13 monster kernel: [87309.624815] ata6.00: status: { DRDY ERR } Feb 22 07:47:13 monster kernel: [87309.624821] ata6.00: error: { UNC } Feb 22 07:47:13 monster systemd-tmpfiles[221685]: Failed to read '/usr/lib/tmpfiles.d/spice-vdagentd.conf': Input/output error Feb 22 07:47:13 monster kernel: [87309.631644] ata6.00: configured for UDMA/133 Feb 22 07:47:13 monster kernel: [87309.631671] sd 5:0:0:0: [sdb] tag#13 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=0s Feb 22 07:47:13 monster kernel: [87309.631679] sd 5:0:0:0: [sdb] tag#13 Sense Key : Medium Error [current] Feb 22 07:47:13 monster kernel: [87309.631686] sd 5:0:0:0: [sdb] tag#13 Add. Sense: Unrecovered read error - auto reallocate failed Feb 22 07:47:13 monster kernel: [87309.631692] sd 5:0:0:0: [sdb] tag#13 CDB: Read(10) 28 00 56 08 ea 30 00 00 08 00 Feb 22 07:47:13 monster kernel: [87309.631696] blk_update_request: I/O error, dev sdb, sector 1443424820 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0 Feb 22 07:47:13 monster kernel: [87309.631727] ata6: EH complete
In this case, the tool that failed reading a file tells us the name of the failing file. But in most cases, that does not happen or the filename is logged somewhere else (maybe a journal, maybe another /var/log/... file).
To find the file, though, we can search for it using the debugfs tool. But first we have to determine a few parameters. The following are the steps I followed.
Source 1: Page on superuser with the question: sector → block
Source 2: Smartmontools wiki on How to deal with Bad Blocks
Here are a few steps in order to be able to use the tool and fix a bad block error:
The error shown above has this one line:
... blk_update_request: I/O error, dev sdb, sector 1443424820 ...
This gives us the device name: sdb
And the sector: 1443424820
Remember these two parameters.
The sector number above can be matched against the partition start & end sector numbers. This information can be retrieved using the fdisk tool like so:
$ sudo fdisk -l /dev/sdb Disk /dev/sdb: 931.52 GiB, 1000207286272 bytes, 1953529856 sectors Disk model: SanDisk SSD PLUS Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disklabel type: gpt Disk identifier: 50ABACAC-1346-4193-8F84-68C2C98AF9B0 Device Start End Sectors Size Type /dev/sdb1 2048 4095 2048 1M BIOS boot /dev/sdb2 4096 209719295 209715200 100G Linux filesystem /dev/sdb3 209719296 1953527807 1743808512 831.5G Linux filesystem
This command gives us a few interesting bit of information:
The total disk size: 931.52 GiB (about 1Tb)
The model of the disk: SanDisk SSD PLUS
And the size of one sector: 5121
Finally, we see the list of partitions with the Start and End sector number. As mentioned above, we're looking to determine where the failing sector is from so we have to search that list for sector 1443424820:
209,719,296 <= 1,443,424,820 <= 1,953,527,807
(I added a few commas to make it easier to read the numbers)
So that means the failing block is on: /dev/sdb3
Now, sectors on a disk usually have a size of 512 bytes. At least, all the disks I've seens on PCs so far have had such sectors and we've determine that is the case in the previous step.
However, with modern operating systems, it is most frequent to format a drive using blocks of 4,096 bytes. Yet, we want to make sure about that parameter. We can do so using a simple tool such as tune2fs2:
sudo tune2fs -l /dev/sdb
This outputs many parameters. What we need is the Block Size line:
Block size: 4096
If you find it easier, you can pipe the output to less and then search or "grep Block" to get only lines with the work "Block" in them. Personally, I like to have all the info at hand, just in case.
The ATA drive, when it gets an error, spits out the sector number. However, to deal with the file system on a partition, you must use a block number. The following is the formula to convert one to the other using expr:
$ expr <sector-from-syslog> - <start-sector-from-fsdisk> <result1> $ expr <result1> \* <sector-size-from-fdisk> / <block-size-from-tune2fs> <result2>
<result2> is our block number.
With our numbers above, we would be doing:
$ expr 1443424820 - 209719296 1233705524 $ expr 1233705524 \* 512 / 4096 154213190
So our block number is: 154213190
Note: expr makes its computations using integer math like done with C or C++. So the last number is always going an integer. If you use a calculator, just ignores the decimal part. That decimal part defines which sector is affected within that block. Unfortunately, our file system is likely to lose the whole block.
Now that we have the block number, we can query the file system for the file name that corresponds to the affected file.
This is done using the debugfs tool like so:
Case 1 — Block not in use, no filename to find
$ sudo debugfs debugfs: open /dev/sdb3 debugfs: testb 154213190 Block 154213190 not in use debugfs: q
In this case, the file system is not using the block anymore. In that case, we can attempt an overwrite with the dd command immediately and the drive should allocate a new block to fix the issue. This is the best case scenario, but rather unlikely unless the file is a temporary file (maybe a cached file from a browser).
Case 2 — Block is in use, find filename
$ sudo debugfs debugfs: open /dev/sdb3 debugfs: testb 154213190 Block 154213190 marked in use debugfs: icheck 154213190 Block Inode number 180428102 45090663 debugfs: ncheck 45090663 Inode Pathname 45090663 /home/alexis/.cache/mozilla-hidden/firefox/bw3kpb6b.default-release/cache2/entries/4A0210FA32D6907D3BFAFC1DEE42E633FE6A7C33 debugfs: q
Note: In Case 1, the icheck command could be used without the testb command. It just tells you that there is no inode attached to that block. However, from what I understand, testb is faster.
As we can see, we use the icheck to transform the Block number in an Inode number.
From that Inode number, we can find the filename using the ncheck command. Now that we know which file is to be tweaked, we can decide how to tweak it.
Yes. If there are hard links to that one file, then multiple files will be affected. I have not encountered that case yet and the ncheck command is likely to give you the original file name. Either way, that should not matter in this case.
In my case, I could not reboot 100% to my X-Windows system because the SVG library had a bad block. That file could not be loaded so the whole thing would break.
Since that was an Operating System file, I could just re-install that file.
First, I had to find out which package it was a part of:
$ dpkg -S /usr/lib/x86_64-linux-gnu/librsvg-2.so.2.48.0 librsvg2-2:amd64: /usr/lib/x86_64-linux-gnu/librsvg-2.so.2.48.0
Second, I verified the exact name of the package. I thought that the ":amd64" part was not necessary:
$ dpkg -l librsvg2-2 Desired=Unknown/Install/Remove/Purge/Hold | Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad) ||/ Name Version Architecture Description +++-================-======================-============-================================================== ii librsvg2-2:amd64 2.52.5+dfsg-3ubuntu0.2 amd64 SAX-based renderer library for SVG files (runtime)
That worked, so the name is librsvg2-2.
Third, I used apt-get as follow:
$ sudo apt-get install --reinstall librsvg2-2
and that worked. It did not bump in any other failing sector while reinstalling. Finally, I could finish booting to X-Windows (I guess I did not mentioned I did all of that through ssh from another computer; I have that backdoor, just in case and that was working beautifully).
The re-install will create new files which will be written at new locations on the disk. I still have a lot of space (sadly, I used under 40% of that SSD drive!) Also modern firmware (the "driver" within the drive itself) can allocate another sector when one is failing. So even if it were to write over that one block, it would still work.
Since the drive is really going bad, later I could not start Firefox properly. Another system file was affected, only this time it was a directory. The trick above did not work on its own. It failed fixing the directory itself and the files inside.
Now, I think I was really lucky again, the failing directory was one of:
/usr/share/doc/<library-name>
Documentation can safely be deleted until the next time you need it... and if I reinstall immediately, it will come right back anyway. So I've done that. In my case, it was:
$ sudo rm -rf /usr/share/doc/libkmldom1 $ sudo apt-get install --reinstall libkmldom1
And then I could start firefox.
I do not think this is important on the spot, but by overwriting the block, you can fix the following READ commands. This can be useful in some situations. If you can just delete the file, that's also sufficient because the next write will anyway allocate another sector (unless you have a really old drive, like from the 90s).
IMPORTANT: if the block is still in use (as determined by the debugfs command above) you are DESTROYING the file—well, you cannot read it at the moment anyway, so it is already destroyed, but this is not fixing the file, it just gets you past the I/O read errors. After that, you'll get other errors, like for example, a bad JPEG image.
sudo dd if=/dev/zero of=/dev/sdb3 bs=4096 count=1 seek=154213190
The /dev/zero is used to clear that block with all zeroes.
The /dev/sdb3 is the output file. It happens to be a block device. The partition to be precise.
The bs=4096 is the size of the block as we determined with tune2fs above.
The count=1 is the number of blocks to overwrite. As you can see, we can't just overwrite that one flaky sector. We have to erase the whole block (4,096 bytes instead of just 512). There may be ways to fix that issue, but I don't think you can read those sectors without a direct access to the drive which is not easy.
The seek=154213190 is the block number we calculated above. Make sure NOT to use the sector number or you would overwrite the wrong block (assuming such a block exists on that partition, which, if you make the mistake, is super likely, right?)
I think that for a browser temporary files, you're pretty safe doing this. You can always reload a page later if it fails.
One question I had was: Why would an SSD drive fail like that?
Well... down to Earth, the reality is that SSD drives are actually pretty short lived even if many people think they can last 10+ years (according to many comments on the Internet...) The truth is that they live a much shorter life, closer to 5 years. Mines survived for about 5 ½ years, which is actually pretty good for cheap drives.
The one thing you can look at is the manufacturer warranty. If it is 5+ years, the drive is likely to live a long time. 3+ years, like mine, it may live around 5 years. If the manufacturer warranty is 1 year, I'd suggest to not buy that drive unless you won't need it for that long because it's likely to fail in under 5 years.
I guess I'll continue to use HDDs because those, in general, last much longer and they offer way more space for the same amount of money.
SSD have their place, silent laptops, fast data center, and, of course, smart phones... but HDD are not dinosaurs yet.
One reason for an SSD do go bad faster is the amount of writes that happen to it. This is being calculated and you can see the info with various commands such as the smartctl tool.
Let's try it:
$ smartclt -a /dev/sdb
Now search the parameters for one that makes sense... (This varies between manufacturers.) One I can see for my drive:
233 Total_NAND_Writes_GiB 0x0032 100 100 --- Old_age Always - 5604
So I've only written about 5.5Tb to the drive.
The tune2fs command we ran above also has this line:
Lifetime writes: 33 TB
That defines how much data can be written total to the drive before it fails so much that it's unusable. Actually, some old drives would prevent further writes once that limit was reached.
As I mentioned above, most of the files affected were OS files, but I also found some firefox cache files. That drive was setup to be the OS / boot drive (i.e. the root directory or "/"). So I expected some writes, especially to /etc, but not that much otherwise (just when upgrading the OS every few days, it's still very low volume).
In fact, there is a "new" folder to take in account now: /snap. I knew of /home and /var and /tmp, but I did not think of that /snap folder. Oops. That's why I had some errors from Firefox as well. I use Firefox a lot, every day.
That being said, if the numbers were real, I should be able to use the drive for another 5.5 x 6 or a total of about 30 years. I'll replace that boot drive anyway and the other SanDisk (same model and shows about the same number of errors) to be safe, but I'll just keep the drives for data which I don't care as much about.
Another good command for additional info about a device is the following:
$ sudo smartctl -l devstat /dev/sdb
Especially, it shows the total number of writes, even if some of those did not make it to the disk itself (i.e. it goes to a cache and gets overwritten before it gets written to the NAND cell).