e2fsck extremly slow, although enough memory exists
Posted
by
kaefert
on Server Fault
See other posts from Server Fault
or by kaefert
Published on 2012-11-06T19:53:12Z
Indexed on
2012/11/06
23:04 UTC
Read the original article
Hit count: 226
I've got this external USB-Disk:
kaefert@blechmobil:~$ lsusb -s 2:3
Bus 002 Device 003: ID 0bc2:3320 Seagate RSS LLC
As can be seen in this dmesg output, there are some problems that prevents that disk from beeing mounted:
kaefert@blechmobil:~$ dmesg | grep sdb
[ 114.474342] sd 5:0:0:0: [sdb] 732566645 4096-byte logical blocks: (3.00 TB/2.72 TiB)
[ 114.475089] sd 5:0:0:0: [sdb] Write Protect is off
[ 114.475092] sd 5:0:0:0: [sdb] Mode Sense: 43 00 00 00
[ 114.475959] sd 5:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[ 114.477093] sd 5:0:0:0: [sdb] 732566645 4096-byte logical blocks: (3.00 TB/2.72 TiB)
[ 114.501649] sdb: sdb1
[ 114.502717] sd 5:0:0:0: [sdb] 732566645 4096-byte logical blocks: (3.00 TB/2.72 TiB)
[ 114.504354] sd 5:0:0:0: [sdb] Attached SCSI disk
[ 116.804408] EXT4-fs (sdb1): ext4_check_descriptors: Checksum for group 3976 failed (47397!=61519)
[ 116.804413] EXT4-fs (sdb1): group descriptors corrupted!
So I went and fired up my favorite partition manager - gparted, and told it to verify and repair the partition sdb1. This made gparted call e2fsck (version 1.42.4 (12-Jun-2012))
e2fsck -f -y -v /dev/sdb1
Although gparted called e2fsck with the "-v" option, sadly it doesn't show me the output of my e2fsck process (bugreport https://bugzilla.gnome.org/show_bug.cgi?id=467925 )
I started this whole thing on Sunday (2012-11-04_2200) evening, so about 48 hours ago, this is what htop says about it now (2012-11-06-1900):
PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command
3704 root 39 19 1560M 1166M 768 R 98.0 19.5 42h56:43 e2fsck -f -y -v /dev/sdb1
Now I found a few posts on the internet that discuss e2fsck running slow, for example:
http://gparted-forum.surf4.info/viewtopic.php?id=13613
where they write that its a good idea to see if the disk is just that slow because maybe its damaged, and I think these outputs tell me that this is not the case in my case:
kaefert@blechmobil:~$ sudo hdparm -tT /dev/sdb
/dev/sdb:
Timing cached reads: 3562 MB in 2.00 seconds = 1783.29 MB/sec
Timing buffered disk reads: 82 MB in 3.01 seconds = 27.26 MB/sec
kaefert@blechmobil:~$ sudo hdparm /dev/sdb
/dev/sdb:
multcount = 0 (off)
readonly = 0 (off)
readahead = 256 (on)
geometry = 364801/255/63, sectors = 5860533160, start = 0
However, although I can read quickly from that disk, this disk speed doesn't seem to be used by e2fsck, considering tools like gkrellm or iotop or this:
kaefert@blechmobil:~$ iostat -x
Linux 3.2.0-2-amd64 (blechmobil) 2012-11-06 _x86_64_ (2 CPU)
avg-cpu: %user %nice %system %iowait %steal %idle
14,24 47,81 14,63 0,95 0,00 22,37
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0,59 8,29 2,42 5,14 43,17 160,17 53,75 0,30 39,80 8,72 54,42 3,95 2,99
sdb 137,54 5,48 9,23 0,20 587,07 22,73 129,35 0,07 7,70 7,51 16,18 2,17 2,04
Now I researched a little bit on how to find out what e2fsck is doing with all that processor time, and I found the tool strace, which gives me this:
kaefert@blechmobil:~$ sudo strace -p3704
lseek(4, 41026998272, SEEK_SET) = 41026998272
write(4, "\212\354K[_\361\3nl\212\245\352\255jR\303\354\312Yv\334p\253r\217\265\3567\325\257\3766"..., 4096) = 4096
lseek(4, 48404766720, SEEK_SET) = 48404766720
read(4, "\7t\260\366\346\337\304\210\33\267j\35\377'\31f\372\252\ffU\317.y\211\360\36\240c\30`\34"..., 4096) = 4096
lseek(4, 41027002368, SEEK_SET) = 41027002368
write(4, "\232]7Ws\321\352\t\1@[+5\263\334\276{\343zZx\352\21\316`1\271[\202\350R`"..., 4096) = 4096
lseek(4, 48404770816, SEEK_SET) = 48404770816
read(4, "\17\362r\230\327\25\346//\210H\v\311\3237\323K\304\306\361a\223\311\324\272?\213\tq \370\24"..., 4096) = 4096
lseek(4, 41027006464, SEEK_SET) = 41027006464
write(4, "\367yy>x\216?=\324Z\305\351\376&\25\244\210\271\22\306}\276\237\370(\214\205G\262\360\257#"..., 4096) = 4096
lseek(4, 48404774912, SEEK_SET) = 48404774912
read(4, "\365\25\0\21|T\0\21}3t_\272\373\222k\r\177\303\1\201\261\221$\261B\232\3142\21U\316"..., 4096) = 4096
^CProcess 3704 detached
around 16 of these lines every second, so 4 read and 4 write operations every second, which I don't consider to be a lot..
And finally, my question: Will this process ever finish? If those numbers from fseek (48404774912) represent bytes, that would be something like 45 gigabytes, with this beeing a 3 terrabyte disk, which would give me 134 days to go, if the speed stays constant, and he scans the disk like this completly and only once.
Do you have some advice for me? I have most of the data on that disk elsewhere, but I've put a lot of hours into sorting and merging it to this disk, so I would prefer to getting this disk up and running again, without formatting it anew. I don't think that the hardware is damaged since the disk is only a few months and since I can't see any I/O errors in the dmesg output.
UPDATE: I just looked at the strace output again (2012-11-06_2300), now it looks like this:
lseek(4, 1419860611072, SEEK_SET) = 1419860611072
read(4, "3#\f\2447\335\0\22A\355\374\276j\204'\207|\217V|\23\245[\7VP\251\242\276\207\317:"..., 4096) = 4096
lseek(4, 43018145792, SEEK_SET) = 43018145792
write(4, "]\206\231\342Y\204-2I\362\242\344\6R\205\361\324\177\265\317C\334V\324\260\334\275t=\10F."..., 4096) = 4096
lseek(4, 1419860615168, SEEK_SET) = 1419860615168
read(4, "\262\305\314Y\367\37x\326\245\226\226\320N\333$s\34\204\311\222\7\315\236\336\300TK\337\264\236\211n"..., 4096) = 4096
lseek(4, 43018149888, SEEK_SET) = 43018149888
write(4, "\271\224m\311\224\25!I\376\16;\377\0\223H\25Yd\201Y\342\r\203\271\24eG<\202{\373V"..., 4096) = 4096
lseek(4, 1419860619264, SEEK_SET) = 1419860619264
read(4, ";d\360\177\n\346\253\210\222|\250\352T\335M\33\260\320\261\7g\222P\344H?t\240\20\2548\310"..., 4096) = 4096
lseek(4, 43018153984, SEEK_SET) = 43018153984
write(4, "\360\252j\317\310\251G\227\335{\214`\341\267\31Y\202\360\v\374\307oq\3063\217Z\223\313\36D\211"..., 4096) = 4096
So this number of the lseeks before the reads, like 1419860619264 are already a lot bigger, standing for 1.29 terabytes if the numbers are bytes, so it doesn't seem to be a linear progress on a big scale, maybe there are only some areas that need work, that have big gaps in between them.
(times are in CET)
© Server Fault or respective owner