Search Results

Search found 75 results on 3 pages for 'smartctl'.

Page 2/3 | < Previous Page | 1 2 3  | Next Page >

  • Confirm disk is broken when it passes all diagnostics

    - by Halfgaar
    I have a system with a potentially broken disk, but the disk passes all manner of diagnostics. I have been unable to confirm that the disk is broken. What are my options? I could just replace the disk, but because this situation is very similar to another more severe situation I have (long story), I'd like to actually make a proper diagnosis as opposed to randomly binning hardware. The issue and history is this: I had a Debian Linux PC (500 MHz P3) acting as router, nagios and munin. It crashed every couple of weeks. No logs or dmesg could be obtained (because it's an old Compaq that only boots when you configure it as keyboardless, making connecting a keyboard later, once it's booted, impossible). At the time, I just replaced the computer with another Compaq (P4 2.4 GHz) because I thought the hardware was faulty. However, it still crashed every couple of weeks. the difference is that on this computer, I can still SSH into it. It gives all kinds of errors on hda. I'd like to confirm that the disk is broken, but nothing I do confirms this: SMART error logs shows no errors. Normally when a disk starts acting up, SMART my pass, but it still records a read-error in the error log. SMART self-test (smartctl -t long /dev/sda) completes without errors. re-allocated sector count (a tell-tale parameter) has been 31 all its life, even when the disk was still in use in my desktop PC years ago, and it still is. The figure never changed. dd if=/dev/sda of=/dev/null bs=4096 passes with flying colors. What else can I do to assess the health of the drive? Again, this is not about making this router fully functional again, this is a disk forensic question, because it just so happens that I have another server that potentially has the same problem, and knowing the answer to this will possibly help me greatly. For the record, below are logs and such. This is the smartctl -a output: smartctl 5.40 2010-07-12 r3124 [i686-pc-linux-gnu] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Model Family: Seagate Barracuda 7200.7 and 7200.7 Plus family Device Model: ST3120026A Serial Number: 5JT1CLQM Firmware Version: 3.06 User Capacity: 120,034,123,776 bytes Device is: In smartctl database [for details use: -P show] ATA Version is: 6 ATA Standard is: ATA/ATAPI-6 T13 1410D revision 2 Local Time is: Mon Jul 1 21:18:33 2013 CEST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 24) The self-test routine was aborted by the host. Total time to complete Offline data collection: ( 430) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. No General Purpose Logging support. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 85) minutes. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 050 046 006 Pre-fail Always - 47766662 3 Spin_Up_Time 0x0003 097 096 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 10 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 31 7 Seek_Error_Rate 0x000f 084 060 030 Pre-fail Always - 820305 9 Power_On_Hours 0x0032 048 048 000 Old_age Always - 46373 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 605 194 Temperature_Celsius 0x0022 036 065 000 Old_age Always - 36 195 Hardware_ECC_Recovered 0x001a 050 046 000 Old_age Always - 47766662 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 196 000 Old_age Always - 6 200 Multi_Zone_Error_Rate 0x0000 100 253 000 Old_age Offline - 0 202 Data_Address_Mark_Errs 0x0032 100 253 000 Old_age Always - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Aborted by host 80% 46361 - # 2 Extended offline Completed without error 00% 46358 - # 3 Short offline Completed without error 00% 12046 - # 4 Extended offline Completed without error 00% 10472 - # 5 Short offline Completed without error 00% 10471 - # 6 Short offline Completed without error 00% 10471 - # 7 Short offline Completed without error 00% 6770 - # 8 Extended offline Aborted by host 90% 5958 - # 9 Extended offline Aborted by host 90% 5951 - #10 Short offline Completed without error 00% 5024 - #11 Extended offline Aborted by host 80% 5024 - #12 Short offline Completed without error 00% 3697 - #13 Short offline Completed without error 00% 237 - #14 Short offline Completed without error 00% 145 - #15 Short offline Completed without error 00% 69 - #16 Extended offline Completed without error 00% 68 - #17 Short offline Completed without error 00% 66 - #18 Short offline Completed without error 00% 49 - #19 Short offline Completed without error 00% 29 - #20 Short offline Completed without error 00% 29 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. And this is the dmesg error when it has crashed (which repeats for a bunch of different sectors): [1755091.211136] sd 0:0:0:0: [sda] Unhandled error code [1755091.211144] sd 0:0:0:0: [sda] Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK [1755091.211151] sd 0:0:0:0: [sda] CDB: Read(10): 28 00 08 fe ad 38 00 00 08 00 [1755091.211166] end_request: I/O error, dev sda, sector 150908216

    Read the article

  • Sporadic disk clicking sound

    - by Abdó
    Hi, I'm having some unusual and sporadic hard disk clicking issues. Here is a cronological description of the facts. I'm using an ASUS P6T-SE with Intel Core i7, 6Gb RAM 600W Power supply and ATI4670 graphics, running Ubuntu 10.10. About one month ago my hard disk (SATA II Seagate Barracuda 1Tb 7200 rpm) started making a clicking sound: a sort of loud tic-tac, every second or so, when involved in disk activity. The system was clearly slower than before at disk access, but it was functional and I could not find any signal of trouble on the linux logs. I disconnected the disk and tried an older SATA drive I had around: no problem with it. Then I reconnected the Seagate disk, and the problem was mysteriously gone. Ubuntu booted normally, usual speed, no clicking. A couple of weeks later, the problem reappeared. I tried disconnecting reconnecting (as it somehow solved the problem before) without luck. So, despite it was a rather new drive, I assumed it was a hardware issue, made backups and bought a new drive. The new drive is a SATA II Seagate Barracuda 1.5 Tb 7200 rpm. I installed both drives at the same time, with the intention of transferring my files from on to the other. To my surprise, when I booted the computer with both drives, both started making the clicking sound !! Even worse, I removed the old drive, leaving the unformated new drive connected, and booted from a LiveCD. It kept clicking ! Puzzled by this, I tried both drives on my laptop with a SATA to USB cable. At the moment I connected any of them, they made one or two unusual clicks and immediately stopped doing that and worked normally. The old drive I thought almost dead, was working like a charm as if nothing happened. Then I thought: "ok, it must be the motherboard. Let's try again". So, I reconnected the old drive to the ASUS P6T motherboard (the same cables and SATA port as before), and it worked as if nothing happened ! The problem was gone again. The new 1.5 Tb drive was also working ok: No clicking nor slowdown. So I left the old 1Tb disk connected and kept using the computer daily during 3 weeks, until today it happened again. Now I don't really know what to do or check. I'm not even sure if it is a hardware issue any more ! This is rather annoying as it seems it happens with a period of 2 or 3 weeks and I have no means of forcing it to happen. Does anyone have a clue of what can causes this behaviour or have any suggestions of things I should check when it happens again ? What I did today is checking some SMART parameters Error log: smartctl -l error /dev/sda. No errors Short selftest: smartctl -t short /dev/sda. No errors Disk Health check: smartctl -H /dev/sda. passed And here are the vendor specific parameters (smartctl -A /dev/sda) Which I'm not quite sure how to interpret. === START OF READ SMART DATA SECTION === SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 120 099 006 Pre-fail Always - 235962588 3 Spin_Up_Time 0x0003 095 095 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 187 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 072 060 030 Pre-fail Always - 16348045 9 Power_On_Hours 0x0032 096 096 000 Old_age Always - 3590 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 94 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 097 000 Old_age Always - 4295164029 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 070 057 045 Old_age Always - 30 (Lifetime Min/Max 19/31) 194 Temperature_Celsius 0x0022 030 043 000 Old_age Always - 30 (0 18 0 0) 195 Hardware_ECC_Recovered 0x001a 037 026 000 Old_age Always - 235962588 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 73950746906346 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 1832967731 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 3294986902 Any clue to this mystery will be really welcome. Thank you very much !!

    Read the article

  • SATA errors reported during boot: exception Emask 0x40 SAct 0x0 SErr 0x80800 action 0x0

    - by digby280
    I have noticed some error during the Linux boot. They seem to continue to occur after the boot adding lines to the log every few seconds. Once booted this normally does not appear to be causing any problems. However, around 1 in 10 boots results in a kernel panic and the computer has on two or three occasions suddenly rebooted after being powered on for a number of hours. I presume the cause of the reboot is a kernel panic as well. I am running Ubuntu 11.10 and I have had Ubuntu installed on the computer for around a year. I have googled around and not found anything useful. I have provided the kernel log lines and the output of smartctl. Can anyone explain exactly what these errors mean, or better still how to resolve them? Apr 2 16:51:27 dell580 kernel: [ 19.831140] EXT4-fs (sdb2): re-mounted. Opts: errors=remount-ro,user_xattr,commit=0 Apr 2 16:51:27 dell580 kernel: [ 19.934194] tg3 0000:03:00.0: eth0: Link is down Apr 2 16:51:28 dell580 kernel: [ 20.929468] tg3 0000:03:00.0: eth0: Link is up at 100 Mbps, full duplex Apr 2 16:51:28 dell580 kernel: [ 20.929471] tg3 0000:03:00.0: eth0: Flow control is on for TX and on for RX Apr 2 16:51:28 dell580 kernel: [ 20.929727] ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready Apr 2 16:51:29 dell580 kernel: [ 21.609381] EXT4-fs (sdb2): re-mounted. Opts: errors=remount-ro,user_xattr,commit=0 Apr 2 16:51:29 dell580 kernel: [ 21.616515] ata2.01: exception Emask 0x40 SAct 0x0 SErr 0x80800 action 0x0 Apr 2 16:51:29 dell580 kernel: [ 21.616519] ata2.01: SError: { HostInt 10B8B } Apr 2 16:51:29 dell580 kernel: [ 21.616525] ata2.00: hard resetting link Apr 2 16:51:29 dell580 kernel: [ 21.934036] ata2.01: hard resetting link Apr 2 16:51:29 dell580 kernel: [ 22.408890] ata2.00: SATA link up 1.5 Gbps (SStatus 113 SControl 300) Apr 2 16:51:29 dell580 kernel: [ 22.408907] ata2.01: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Apr 2 16:51:29 dell580 kernel: [ 22.440934] ata2.00: configured for UDMA/100 Apr 2 16:51:29 dell580 kernel: [ 22.449040] ata2.01: configured for UDMA/133 Apr 2 16:51:29 dell580 kernel: [ 22.449818] ata2: EH complete Apr 2 16:51:33 dell580 kernel: [ 26.122664] ata2.01: exception Emask 0x40 SAct 0x0 SErr 0x80800 action 0x0 Apr 2 16:51:33 dell580 kernel: [ 26.122670] ata2.01: SError: { HostInt 10B8B } Apr 2 16:51:33 dell580 kernel: [ 26.122677] ata2.00: hard resetting link Apr 2 16:51:33 dell580 kernel: [ 26.442684] ata2.01: hard resetting link Apr 2 16:51:34 dell580 kernel: [ 26.925545] ata2.00: SATA link up 1.5 Gbps (SStatus 113 SControl 300) Apr 2 16:51:34 dell580 kernel: [ 26.925561] ata2.01: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Apr 2 16:51:34 dell580 kernel: [ 26.961542] ata2.00: configured for UDMA/100 Apr 2 16:51:34 dell580 kernel: [ 26.969616] ata2.01: configured for UDMA/133 Apr 2 16:51:34 dell580 kernel: [ 26.970400] ata2: EH complete Apr 2 16:51:35 dell580 kernel: [ 28.111180] ata2.01: exception Emask 0x40 SAct 0x0 SErr 0x80800 action 0x0 Apr 2 16:51:35 dell580 kernel: [ 28.111184] ata2.01: SError: { HostInt 10B8B } Apr 2 16:51:35 dell580 kernel: [ 28.111191] ata2.00: hard resetting link Apr 2 16:51:35 dell580 kernel: [ 28.429674] ata2.01: hard resetting link Apr 2 16:51:36 dell580 kernel: [ 28.904557] ata2.00: SATA link up 1.5 Gbps (SStatus 113 SControl 300) Apr 2 16:51:36 dell580 kernel: [ 28.904572] ata2.01: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Apr 2 16:51:36 dell580 kernel: [ 28.936609] ata2.00: configured for UDMA/100 Apr 2 16:51:36 dell580 kernel: [ 28.944692] ata2.01: configured for UDMA/133 Apr 2 16:51:36 dell580 kernel: [ 28.945464] ata2: EH complete Apr 2 16:51:38 dell580 kernel: [ 31.581756] eth0: no IPv6 routers present Apr 2 16:51:38 dell580 kernel: [ 32.103066] ata2.01: exception Emask 0x40 SAct 0x0 SErr 0x80800 action 0x0 Apr 2 16:51:38 dell580 kernel: [ 32.103074] ata2.01: SError: { HostInt 10B8B } Apr 2 16:51:38 dell580 kernel: [ 32.103085] ata2.00: hard resetting link Apr 2 16:51:38 dell580 kernel: [ 32.419669] ata2.01: hard resetting link Apr 2 16:51:39 dell580 kernel: [ 32.894518] ata2.00: SATA link up 1.5 Gbps (SStatus 113 SControl 300) Apr 2 16:51:39 dell580 kernel: [ 32.894533] ata2.01: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Apr 2 16:51:39 dell580 kernel: [ 32.926536] ata2.00: configured for UDMA/100 Apr 2 16:51:39 dell580 kernel: [ 32.934715] ata2.01: configured for UDMA/133 Apr 2 16:51:39 dell580 kernel: [ 32.935578] ata2: EH complete Here's the output of smartctl for the drive. smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.0.0-17-generic] (local build) Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Model Family: SAMSUNG SpinPoint F1 DT Device Model: SAMSUNG HD103UJ Serial Number: S13PJ90QC19706 LU WWN Device Id: 5 0000f0 00b1c7960 Firmware Version: 1AA01113 User Capacity: 1,000,204,886,016 bytes [1.00 TB] Sector Size: 512 bytes logical/physical Device is: In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 3b Local Time is: Mon Apr 2 17:13:48 2012 BST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 41) The self-test routine was interrupted by the host with a hard or soft reset. Total time to complete Offline data collection: (11772) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time: ( 197) minutes. Conveyance self-test routine recommended polling time: ( 21) minutes. SCT capabilities: (0x003f) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 100 100 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0007 076 076 011 Pre-fail Always - 7940 4 Start_Stop_Count 0x0032 099 099 000 Old_age Always - 521 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 253 253 051 Pre-fail Always - 0 8 Seek_Time_Performance 0x0025 100 100 015 Pre-fail Offline - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 642 10 Spin_Retry_Count 0x0033 100 100 051 Pre-fail Always - 0 11 Calibration_Retry_Count 0x0012 100 100 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 482 13 Read_Soft_Error_Rate 0x000e 100 100 000 Old_age Always - 0 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 759 184 End-to-End_Error 0x0033 100 100 000 Pre-fail Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 073 069 000 Old_age Always - 27 (Min/Max 16/27) 194 Temperature_Celsius 0x0022 073 067 000 Old_age Always - 27 (Min/Max 16/28) 195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age Always - 320028 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 099 099 000 Old_age Always - 1494 200 Multi_Zone_Error_Rate 0x000a 100 100 000 Old_age Always - 0 201 Soft_Read_Error_Rate 0x000a 253 253 000 Old_age Always - 0 SMART Error Log Version: 1 ATA Error Count: 211 (device log contains only the most recent five errors) CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 211 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 51 0f 31 63 8f e1 Error: ICRC, ABRT 15 sectors at LBA = 0x018f6331 = 26174257 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 00 40 62 8f e1 08 00:01:00.460 READ DMA c8 00 20 00 7c 30 e0 08 00:01:00.450 READ DMA c8 00 00 10 49 8f e1 08 00:01:00.440 READ DMA c8 00 e0 20 d0 30 e0 08 00:01:00.420 READ DMA c8 00 00 c0 59 90 e1 08 00:01:00.400 READ DMA Error 210 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 51 cf e9 cf 66 e0 Error: ICRC, ABRT 207 sectors at LBA = 0x0066cfe9 = 6737897 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 00 b8 cf 66 e0 08 00:08:29.780 READ DMA c8 00 60 60 c9 18 e0 08 00:08:29.770 READ DMA c8 00 40 20 c9 18 e0 08 00:08:29.770 READ DMA c8 00 20 00 c9 18 e0 08 00:08:29.760 READ DMA c8 00 20 98 cf 66 e0 08 00:08:29.750 READ DMA Error 209 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 51 2f d1 74 e0 e0 Error: ICRC, ABRT 47 sectors at LBA = 0x00e074d1 = 14709969 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 00 00 74 e0 e0 08 00:00:30.940 READ DMA c8 00 20 18 36 de e0 08 00:00:30.930 READ DMA c8 00 08 48 f1 dd e0 08 00:00:30.930 READ DMA c8 00 08 a8 f0 dd e0 08 00:00:30.930 READ DMA c8 00 08 90 f0 dd e0 08 00:00:30.930 READ DMA Error 208 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 84 51 7f 21 88 9d e0 Error: ICRC, ABRT 127 sectors at LBA = 0x009d8821 = 10324001 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- c8 00 a0 00 88 9d e0 08 00:00:27.610 READ DMA c8 00 58 a8 e7 9c e0 08 00:00:27.610 READ DMA c8 00 00 28 e6 9c e0 08 00:00:27.610 READ DMA c8 00 00 e0 e4 9c e0 08 00:00:27.610 READ DMA c8 00 00 90 e0 9c e0 08 00:00:27.600 READ DMA Error 207 occurred at disk power-on lifetime: 0 hours (0 days + 0 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 04 51 26 6a 6a c3 e0 Error: ABRT at LBA = 0x00c36a6a = 12806762 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- ca 00 00 90 69 c3 e0 08 00:29:39.350 WRITE DMA ca 00 40 90 68 c3 e0 08 00:29:39.350 WRITE DMA ca 00 40 50 65 c3 e0 08 00:29:39.350 WRITE DMA ca 00 40 d0 64 c3 e0 08 00:29:39.350 WRITE DMA ca 00 40 90 63 c3 e0 08 00:29:39.350 WRITE DMA SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Interrupted (host reset) 90% 638 - # 2 Short offline Interrupted (host reset) 90% 638 - # 3 Extended offline Interrupted (host reset) 90% 638 - # 4 Short offline Interrupted (host reset) 90% 638 - # 5 Extended offline Interrupted (host reset) 90% 638 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.

    Read the article

  • How to use Secure Erase and is it on the install CD?

    - by Mikey
    Supposedly there is some built in hard drive magic called "Secure Erase" which is wildly faster and more secure than "dd if=/dev/zero..." I am most excited about the speed increase. There seems to be a GUI for it as part of Parted Magic: http://www.ocztechnologyforum.com/forum/showthread.php?81321-Secure-Erase-With-bootable-CD-USB-Linux..-Point-and-Click-Method Is there something like this for Ubuntu? Better yet, is there a way to actually issue this command "manually" like with smartctl or something?

    Read the article

  • Is my hard drive about to fail?

    - by Cody Harlow
    I hear some squeaking noises sometimes when I use my computer so I ran smartctl. This is the results: === START OF READ SMART DATA SECTION === SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed: read failure 90% 5953 37922655 # 2 Extended offline Completed: read failure 90% 5953 37922655 # 3 Short offline Completed: read failure 90% 5953 37922655 # 4 Short offline Completed without error 00% 429 - # 5 Extended offline Aborted by host 90% 429 - # 6 Short offline Completed without error 00% 429 - # 7 Short offline Completed without error 00% 429 - Is this a bad sign?

    Read the article

  • laptop crashed: why?

    - by sds
    my linux (ubuntu 12.04) laptop crashed, and I am trying to figure out why. # last sds pts/4 :0 Tue Sep 4 10:01 still logged in sds pts/3 :0 Tue Sep 4 10:00 still logged in reboot system boot 3.2.0-29-generic Tue Sep 4 09:43 - 11:23 (01:40) sds pts/8 :0 Mon Sep 3 14:23 - crash (19:19) this seems to indicate a crash at 09:42 (= 14:23+19:19). as per another question, I looked at /var/log: auth.log: Sep 4 09:17:02 t520sds CRON[32744]: pam_unix(cron:session): session closed for user root Sep 4 09:43:17 t520sds lightdm: pam_unix(lightdm:session): session opened for user lightdm by (uid=0) no messages file syslog: Sep 4 09:24:19 t520sds kernel: [219104.819975] CPU0: Package power limit normal Sep 4 09:43:16 t520sds kernel: imklog 5.8.6, log source = /proc/kmsg started. kern.log: Sep 4 09:24:19 t520sds kernel: [219104.819969] CPU1: Package power limit normal Sep 4 09:24:19 t520sds kernel: [219104.819971] CPU2: Package power limit normal Sep 4 09:24:19 t520sds kernel: [219104.819974] CPU3: Package power limit normal Sep 4 09:24:19 t520sds kernel: [219104.819975] CPU0: Package power limit normal Sep 4 09:43:16 t520sds kernel: imklog 5.8.6, log source = /proc/kmsg started. Sep 4 09:43:16 t520sds kernel: [ 0.000000] Initializing cgroup subsys cpuset Sep 4 09:43:16 t520sds kernel: [ 0.000000] Initializing cgroup subsys cpu I had a computation running until 9:24, but the system crashed 18 minutes later! kern.log has many pages of these: Sep 4 09:43:16 t520sds kernel: [ 0.000000] total RAM covered: 8086M Sep 4 09:43:16 t520sds kernel: [ 0.000000] gran_size: 64K chunk_size: 64K num_reg: 10 lose cover RAM: 38M Sep 4 09:43:16 t520sds kernel: [ 0.000000] gran_size: 64K chunk_size: 128K num_reg: 10 lose cover RAM: 38M Sep 4 09:43:16 t520sds kernel: [ 0.000000] gran_size: 64K chunk_size: 256K num_reg: 10 lose cover RAM: 38M Sep 4 09:43:16 t520sds kernel: [ 0.000000] gran_size: 64K chunk_size: 512K num_reg: 10 lose cover RAM: 38M Sep 4 09:43:16 t520sds kernel: [ 0.000000] gran_size: 64K chunk_size: 1M num_reg: 10 lose cover RAM: 38M Sep 4 09:43:16 t520sds kernel: [ 0.000000] gran_size: 64K chunk_size: 2M num_reg: 10 lose cover RAM: 38M Sep 4 09:43:16 t520sds kernel: [ 0.000000] gran_size: 64K chunk_size: 4M num_reg: 10 lose cover RAM: 38M Sep 4 09:43:16 t520sds kernel: [ 0.000000] gran_size: 64K chunk_size: 8M num_reg: 10 lose cover RAM: 38M Sep 4 09:43:16 t520sds kernel: [ 0.000000] gran_size: 64K chunk_size: 16M num_reg: 10 lose cover RAM: 38M Sep 4 09:43:16 t520sds kernel: [ 0.000000] *BAD*gran_size: 64K chunk_size: 32M num_reg: 10 lose cover RAM: -16M Sep 4 09:43:16 t520sds kernel: [ 0.000000] *BAD*gran_size: 64K chunk_size: 64M num_reg: 10 lose cover RAM: -16M Sep 4 09:43:16 t520sds kernel: [ 0.000000] gran_size: 64K chunk_size: 128M num_reg: 10 lose cover RAM: 0G Sep 4 09:43:16 t520sds kernel: [ 0.000000] gran_size: 64K chunk_size: 256M num_reg: 10 lose cover RAM: 0G Sep 4 09:43:16 t520sds kernel: [ 0.000000] gran_size: 64K chunk_size: 512M num_reg: 10 lose cover RAM: 0G Sep 4 09:43:16 t520sds kernel: [ 0.000000] gran_size: 64K chunk_size: 1G num_reg: 10 lose cover RAM: 0G Sep 4 09:43:16 t520sds kernel: [ 0.000000] *BAD*gran_size: 64K chunk_size: 2G num_reg: 10 lose cover RAM: -1G does this mean that my RAM is bad?! it also says Sep 4 09:43:16 t520sds kernel: [ 2.944123] EXT4-fs (sda1): INFO: recovery required on readonly filesystem Sep 4 09:43:16 t520sds kernel: [ 2.944126] EXT4-fs (sda1): write access will be enabled during recovery Sep 4 09:43:16 t520sds kernel: [ 3.088001] firewire_core: created device fw0: GUID f0def1ff8fbd7dff, S400 Sep 4 09:43:16 t520sds kernel: [ 8.929243] EXT4-fs (sda1): orphan cleanup on readonly fs Sep 4 09:43:16 t520sds kernel: [ 8.929249] EXT4-fs (sda1): ext4_orphan_cleanup: deleting unreferenced inode 658984 ... Sep 4 09:43:16 t520sds kernel: [ 9.343266] EXT4-fs (sda1): ext4_orphan_cleanup: deleting unreferenced inode 525343 Sep 4 09:43:16 t520sds kernel: [ 9.343270] EXT4-fs (sda1): 56 orphan inodes deleted Sep 4 09:43:16 t520sds kernel: [ 9.343271] EXT4-fs (sda1): recovery complete Sep 4 09:43:16 t520sds kernel: [ 9.645799] EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null) does this mean my HD is bad? As per FaultyHardware, I tried smartctl -l selftest, which uncovered no errors: smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-30-generic] (local build) Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Model Family: Seagate Momentus 7200.4 Device Model: ST9500420AS Serial Number: 5VJE81YK LU WWN Device Id: 5 000c50 0440defe3 Firmware Version: 0003LVM1 User Capacity: 500,107,862,016 bytes [500 GB] Sector Size: 512 bytes logical/physical Device is: In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: ATA-8-ACS revision 4 Local Time is: Mon Sep 10 16:40:04 2012 EDT SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED See vendor-specific Attribute list for marginal Attributes. General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 0) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 109) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x103b) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 117 099 034 Pre-fail Always - 162843537 3 Spin_Up_Time 0x0003 100 100 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 571 5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 069 060 030 Pre-fail Always - 17210154023 9 Power_On_Hours 0x0032 095 095 000 Old_age Always - 174362787320258 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 571 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 1 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 061 043 045 Old_age Always In_the_past 39 (0 11 44 26) 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 84 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 20 193 Load_Cycle_Count 0x0032 099 099 000 Old_age Always - 2434 194 Temperature_Celsius 0x0022 039 057 000 Old_age Always - 39 (0 15 0 0) 195 Hardware_ECC_Recovered 0x001a 041 041 000 Old_age Always - 162843537 196 Reallocated_Event_Count 0x000f 095 095 030 Pre-fail Always - 4540 (61955, 0) 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 254 Free_Fall_Sensor 0x0032 100 100 000 Old_age Always - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed without error 00% 4545 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay. Googling for the messages proved inconclusive, I can't even figure out whether the messages are routine or catastrophic. So, what do I do now?

    Read the article

  • External USB drive is failing

    - by dma_k
    I have an external USB 2.0 drive WD My Book Mirror Edition, running in RAID 1 (mirroring) mode. A while ago the hard drive started to fail: it stops responding (directories are not listed returning an error after a big timeout). Sometimes it works for weeks before a failure, sometimes – few hours. Small write operations (like removing few files or editing a small file) do not harm, but when copying large files to the drive over the network, or creating the archive locally, the kernel dumps. Also interesting to note that once kernel has failed, Linux does not want to reboot normally (reboot hangs); when Linux box is shutdown with power button, WD drive does not go to sleep mode (as it usually does): leds continue to run, pressing and holding the "shutdown" button on drive's back panel does not do anything; only unplugging the power cord helps. Here goes the boot log: Aug 16 00:32:21 kernel: [ 1.514106] ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver Aug 16 00:32:21 kernel: [ 1.657738] ehci_hcd 0000:00:1d.7: PCI INT A -> GSI 23 (level, low) -> IRQ 23 Aug 16 00:32:21 kernel: [ 1.673747] ehci_hcd 0000:00:1d.7: setting latency timer to 64 Aug 16 00:32:21 kernel: [ 1.673751] ehci_hcd 0000:00:1d.7: EHCI Host Controller Aug 16 00:32:21 kernel: [ 1.725224] ehci_hcd 0000:00:1d.7: new USB bus registered, assigned bus number 1 Aug 16 00:32:21 kernel: [ 1.741647] ehci_hcd 0000:00:1d.7: using broken periodic workaround Aug 16 00:32:21 kernel: [ 1.761790] ehci_hcd 0000:00:1d.7: cache line size of 32 is not supported Aug 16 00:32:21 kernel: [ 1.761873] ehci_hcd 0000:00:1d.7: irq 23, io mem 0xfdfff000 Aug 16 00:32:21 kernel: [ 1.796043] ehci_hcd 0000:00:1d.7: USB 2.0 started, EHCI 1.00 Aug 16 00:32:21 kernel: [ 1.879069] usb usb1: New USB device found, idVendor=1d6b, idProduct=0002 Aug 16 00:32:21 kernel: [ 1.895446] usb usb1: New USB device strings: Mfr=3, Product=2, SerialNumber=1 Aug 16 00:32:21 kernel: [ 1.911796] usb usb1: Product: EHCI Host Controller Aug 16 00:32:21 kernel: [ 1.928015] usb usb1: Manufacturer: Linux 2.6.32-5-686 ehci_hcd Aug 16 00:32:21 kernel: [ 1.944331] usb usb1: SerialNumber: 0000:00:1d.7 Aug 16 00:32:21 kernel: [ 1.961285] usb usb1: configuration #1 chosen from 1 choice Aug 16 00:32:21 kernel: [ 1.994412] hub 1-0:1.0: USB hub found Aug 16 00:32:21 kernel: [ 2.010864] hub 1-0:1.0: 8 ports detected Aug 16 00:32:21 kernel: [ 2.085939] uhci_hcd: USB Universal Host Controller Interface driver Aug 16 00:32:21 kernel: [ 2.191945] uhci_hcd 0000:00:1d.0: PCI INT A -> GSI 23 (level, low) -> IRQ 23 Aug 16 00:32:21 kernel: [ 2.226029] uhci_hcd 0000:00:1d.0: setting latency timer to 64 Aug 16 00:32:21 kernel: [ 2.226034] uhci_hcd 0000:00:1d.0: UHCI Host Controller Aug 16 00:32:21 kernel: [ 2.243237] uhci_hcd 0000:00:1d.0: new USB bus registered, assigned bus number 2 Aug 16 00:32:21 kernel: [ 2.260390] uhci_hcd 0000:00:1d.0: irq 23, io base 0x0000fe00 Aug 16 00:32:21 kernel: [ 2.277517] usb usb2: New USB device found, idVendor=1d6b, idProduct=0001 Aug 16 00:32:21 kernel: [ 2.294815] usb usb2: New USB device strings: Mfr=3, Product=2, SerialNumber=1 Aug 16 00:32:21 kernel: [ 2.312173] usb usb2: Product: UHCI Host Controller Aug 16 00:32:21 kernel: [ 2.329534] usb usb2: Manufacturer: Linux 2.6.32-5-686 uhci_hcd Aug 16 00:32:21 kernel: [ 2.346828] usb usb2: SerialNumber: 0000:00:1d.0 Aug 16 00:32:21 kernel: [ 2.412989] usb usb2: configuration #1 chosen from 1 choice Aug 16 00:32:21 kernel: [ 2.430651] usb 1-2: new high speed USB device using ehci_hcd and address 2 Aug 16 00:32:21 kernel: [ 2.449046] hub 2-0:1.0: USB hub found Aug 16 00:32:21 kernel: [ 2.466514] hub 2-0:1.0: 2 ports detected Aug 16 00:32:21 kernel: [ 2.484639] uhci_hcd 0000:00:1d.1: PCI INT B -> GSI 19 (level, low) -> IRQ 19 Aug 16 00:32:21 kernel: [ 2.537750] uhci_hcd 0000:00:1d.1: setting latency timer to 64 Aug 16 00:32:21 kernel: [ 2.537756] uhci_hcd 0000:00:1d.1: UHCI Host Controller Aug 16 00:32:21 kernel: [ 2.555085] uhci_hcd 0000:00:1d.1: new USB bus registered, assigned bus number 3 Aug 16 00:32:21 kernel: [ 2.572231] uhci_hcd 0000:00:1d.1: irq 19, io base 0x0000fd00 Aug 16 00:32:21 kernel: [ 2.589593] usb usb3: New USB device found, idVendor=1d6b, idProduct=0001 Aug 16 00:32:21 kernel: [ 2.606869] usb usb3: New USB device strings: Mfr=3, Product=2, SerialNumber=1 Aug 16 00:32:21 kernel: [ 2.624134] usb usb3: Product: UHCI Host Controller Aug 16 00:32:21 kernel: [ 2.641329] usb usb3: Manufacturer: Linux 2.6.32-5-686 uhci_hcd Aug 16 00:32:21 kernel: [ 2.658505] usb usb3: SerialNumber: 0000:00:1d.1 Aug 16 00:32:21 kernel: [ 2.675843] usb usb3: configuration #1 chosen from 1 choice Aug 16 00:32:21 kernel: [ 2.692864] hub 3-0:1.0: USB hub found Aug 16 00:32:21 kernel: [ 2.709651] hub 3-0:1.0: 2 ports detected Aug 16 00:32:21 kernel: [ 2.727378] uhci_hcd 0000:00:1d.2: PCI INT C -> GSI 18 (level, low) -> IRQ 18 Aug 16 00:32:21 kernel: [ 2.768252] uhci_hcd 0000:00:1d.2: setting latency timer to 64 Aug 16 00:32:21 kernel: [ 2.768258] uhci_hcd 0000:00:1d.2: UHCI Host Controller Aug 16 00:32:21 kernel: [ 2.806679] uhci_hcd 0000:00:1d.2: new USB bus registered, assigned bus number 4 Aug 16 00:32:21 kernel: [ 2.824117] uhci_hcd 0000:00:1d.2: irq 18, io base 0x0000fc00 Aug 16 00:32:21 kernel: [ 2.841405] usb 1-2: New USB device found, idVendor=1058, idProduct=1104 Aug 16 00:32:21 kernel: [ 2.858448] usb 1-2: New USB device strings: Mfr=1, Product=2, SerialNumber=3 Aug 16 00:32:21 kernel: [ 2.875347] usb 1-2: Product: My Book Aug 16 00:32:21 kernel: [ 2.892113] usb 1-2: Manufacturer: Western Digital Aug 16 00:32:21 kernel: [ 2.908915] usb 1-2: SerialNumber: 575532553130303530353538 Aug 16 00:32:21 kernel: [ 2.943242] usb usb4: New USB device found, idVendor=1d6b, idProduct=0001 Aug 16 00:32:21 kernel: [ 2.960405] usb usb4: New USB device strings: Mfr=3, Product=2, SerialNumber=1 Aug 16 00:32:21 kernel: [ 2.977615] usb usb4: Product: UHCI Host Controller Aug 16 00:32:21 kernel: [ 2.994687] usb usb4: Manufacturer: Linux 2.6.32-5-686 uhci_hcd Aug 16 00:32:21 kernel: [ 3.011711] usb usb4: SerialNumber: 0000:00:1d.2 Aug 16 00:32:21 kernel: [ 3.029589] usb usb4: configuration #1 chosen from 1 choice Aug 16 00:32:21 kernel: [ 3.082027] sd 2:0:0:0: [sda] Attached SCSI disk Aug 16 00:32:21 kernel: [ 3.103953] usb 1-2: configuration #1 chosen from 1 choice Aug 16 00:32:21 kernel: [ 3.122625] hub 4-0:1.0: USB hub found Aug 16 00:32:21 kernel: [ 3.140484] hub 4-0:1.0: 2 ports detected Aug 16 00:32:21 kernel: [ 3.161680] uhci_hcd 0000:00:1d.3: PCI INT D -> GSI 16 (level, low) -> IRQ 16 Aug 16 00:32:21 kernel: [ 3.181257] uhci_hcd 0000:00:1d.3: setting latency timer to 64 Aug 16 00:32:21 kernel: [ 3.181263] uhci_hcd 0000:00:1d.3: UHCI Host Controller Aug 16 00:32:21 kernel: [ 3.198614] uhci_hcd 0000:00:1d.3: new USB bus registered, assigned bus number 5 Aug 16 00:32:21 kernel: [ 3.216012] uhci_hcd 0000:00:1d.3: irq 16, io base 0x0000fb00 Aug 16 00:32:21 kernel: [ 3.249877] Uniform CD-ROM driver Revision: 3.20 Aug 16 00:32:21 kernel: [ 3.267765] usb usb5: New USB device found, idVendor=1d6b, idProduct=0001 Aug 16 00:32:21 kernel: [ 3.284947] usb usb5: New USB device strings: Mfr=3, Product=2, SerialNumber=1 Aug 16 00:32:21 kernel: [ 3.302023] usb usb5: Product: UHCI Host Controller Aug 16 00:32:21 kernel: [ 3.319215] usb usb5: Manufacturer: Linux 2.6.32-5-686 uhci_hcd Aug 16 00:32:21 kernel: [ 3.336298] usb usb5: SerialNumber: 0000:00:1d.3 Aug 16 00:32:21 kernel: [ 3.368377] Initializing USB Mass Storage driver... Aug 16 00:32:21 kernel: [ 3.390652] usbcore: registered new interface driver hiddev Aug 16 00:32:21 kernel: [ 3.408109] scsi4 : SCSI emulation for USB Mass Storage devices Aug 16 00:32:21 kernel: [ 3.425281] sr 0:0:1:0: Attached scsi CD-ROM sr0 Aug 16 00:32:21 kernel: [ 3.438978] sr 0:0:1:0: Attached scsi generic sg0 type 5 Aug 16 00:32:21 kernel: [ 3.456328] usbcore: registered new interface driver usb-storage Aug 16 00:32:21 kernel: [ 3.474564] usb-storage: device found at 2 Aug 16 00:32:21 kernel: [ 3.474567] usb-storage: waiting for device to settle before scanning Aug 16 00:32:21 kernel: [ 3.475320] sd 2:0:0:0: Attached scsi generic sg1 type 0 Aug 16 00:32:21 kernel: [ 3.492587] USB Mass Storage support registered. Aug 16 00:32:21 kernel: [ 3.510930] usb usb5: configuration #1 chosen from 1 choice Aug 16 00:32:21 kernel: [ 3.531076] hub 5-0:1.0: USB hub found Aug 16 00:32:21 kernel: [ 3.548399] hub 5-0:1.0: 2 ports detected Aug 16 00:32:21 kernel: [ 3.591743] input: Western Digital My Book as /devices/pci0000:00/0000:00:1d.7/usb1/1-2/1-2:1.1/input/input2 Aug 16 00:32:21 kernel: [ 3.609515] generic-usb 0003:1058:1104.0001: input,hidraw0: USB HID v1.11 Device [Western Digital My Book] on usb-0000:00:1d.7-2/input1 Aug 16 00:32:21 kernel: [ 3.627466] usbcore: registered new interface driver usbhid Aug 16 00:32:21 kernel: [ 8.581664] usb-storage: device scan complete Aug 16 00:32:21 kernel: [ 8.624270] scsi 4:0:0:0: Direct-Access WD My Book 1008 PQ: 0 ANSI: 4 Aug 16 00:32:21 kernel: [ 8.655135] scsi 4:0:0:1: Enclosure WD My Book Device 1008 PQ: 0 ANSI: 4 Aug 16 00:32:21 kernel: [ 8.675393] sd 4:0:0:0: Attached scsi generic sg2 type 0 Aug 16 00:32:21 kernel: [ 8.698669] scsi 4:0:0:1: Attached scsi generic sg3 type 13 Aug 16 00:32:21 kernel: [ 8.723370] sd 4:0:0:0: [sdb] 1953513472 512-byte logical blocks: (1.00 TB/931 GiB) Aug 16 00:32:21 kernel: [ 8.750477] sd 4:0:0:0: [sdb] Write Protect is off Aug 16 00:32:21 kernel: [ 8.769411] sd 4:0:0:0: [sdb] Mode Sense: 10 00 00 00 Aug 16 00:32:21 kernel: [ 8.769414] sd 4:0:0:0: [sdb] Assuming drive cache: write through Aug 16 00:32:21 kernel: [ 8.822971] sd 4:0:0:0: [sdb] Assuming drive cache: write through Aug 16 00:32:21 kernel: [ 8.841978] sdb: sdb1 Aug 16 00:32:21 kernel: [ 8.905580] sd 4:0:0:0: [sdb] Assuming drive cache: write through Aug 16 00:32:21 kernel: [ 8.924173] sd 4:0:0:0: [sdb] Attached SCSI disk Aug 16 00:32:21 kernel: [ 11.600492] XFS mounting filesystem sdb1 Aug 16 00:32:21 kernel: [ 12.222948] Ending clean XFS mount for filesystem: sdb1 After a while the following appears in a log: Aug 16 09:30:56 kernel: [32359.112029] usb 1-2: reset high speed USB device using ehci_hcd and address 2 Aug 16 09:31:59 kernel: [32422.112035] usb 1-2: reset high speed USB device using ehci_hcd and address 2 Aug 16 09:33:00 kernel: [32483.112029] usb 1-2: reset high speed USB device using ehci_hcd and address 2 And then it is followed by few kernel dumps, which I think, are not good: Aug 16 09:33:40 kernel: [32520.428027] INFO: task xfssyncd:1002 blocked for more than 120 seconds. Aug 16 09:33:40 kernel: [32520.462689] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Aug 16 09:33:40 kernel: [32520.497422] xfssyncd D c3d84a60 0 1002 2 0x00000000 Aug 16 09:33:40 kernel: [32520.532117] f6c9aa80 00000046 c1132742 c3d84a60 00000286 c1418100 c1418100 00000000 Aug 16 09:33:40 kernel: [32520.566867] f6c9ac3c c2808100 00000000 f653b18b 00001d76 00000001 f6c9aa80 c3c3f0e0 Aug 16 09:33:40 kernel: [32520.601343] 08e59242 f6c9ac3c 2e41392b 00000000 08e59242 00000000 c3f7fb48 0067385a Aug 16 09:33:40 kernel: [32520.635533] Call Trace: Aug 16 09:33:40 kernel: [32520.668991] [<c1132742>] ? cfq_set_request+0x0/0x290 Aug 16 09:33:40 kernel: [32520.702804] [<c126b532>] ? io_schedule+0x5f/0x98 Aug 16 09:33:40 kernel: [32520.736555] [<c1128be0>] ? get_request_wait+0xcb/0x146 Aug 16 09:33:40 kernel: [32520.770360] [<c10437ba>] ? autoremove_wake_function+0x0/0x2d Aug 16 09:33:40 kernel: [32520.804110] [<c112907c>] ? __make_request+0x2cc/0x3d9 Aug 16 09:33:40 kernel: [32520.837713] [<c1128230>] ? blk_peek_request+0x135/0x143 Aug 16 09:33:40 kernel: [32520.871265] [<f8582987>] ? scsi_dispatch_cmd+0x185/0x1e5 [scsi_mod] Aug 16 09:33:40 kernel: [32520.904407] [<c1127cf1>] ? generic_make_request+0x266/0x2b4 Aug 16 09:33:40 kernel: [32520.937007] [<c10cf821>] ? bvec_alloc_bs+0x95/0xaf Aug 16 09:33:40 kernel: [32520.969033] [<c1127dfb>] ? submit_bio+0xbc/0xd6 Aug 16 09:33:40 kernel: [32521.000485] [<c10cffd1>] ? bio_add_page+0x28/0x2e Aug 16 09:33:40 kernel: [32521.031403] [<f8918d38>] ? _xfs_buf_ioapply+0x206/0x22b [xfs] Aug 16 09:33:40 kernel: [32521.061888] [<f89197bd>] ? xfs_buf_iorequest+0x38/0x60 [xfs] Aug 16 09:33:40 kernel: [32521.091845] [<f8907230>] ? xlog_bdstrat_cb+0x16/0x3d [xfs] Aug 16 09:33:40 kernel: [32521.121222] [<f8905781>] ? XFS_bwrite+0x32/0x64 [xfs] Aug 16 09:33:40 kernel: [32521.150007] [<f89059be>] ? xlog_sync+0x20b/0x311 [xfs] Aug 16 09:33:40 kernel: [32521.178214] [<f89112fc>] ? xfs_trans_ail_tail+0x12/0x27 [xfs] Aug 16 09:33:40 kernel: [32521.205914] [<f8906261>] ? xlog_state_sync_all+0xa2/0x141 [xfs] Aug 16 09:33:40 kernel: [32521.233074] [<f8906611>] ? _xfs_log_force+0x51/0x68 [xfs] Aug 16 09:33:40 kernel: [32521.259664] [<c103abaf>] ? process_timeout+0x0/0x5 Aug 16 09:33:40 kernel: [32521.285662] [<f8906636>] ? xfs_log_force+0xe/0x27 [xfs] Aug 16 09:33:40 kernel: [32521.311171] [<f89202df>] ? xfs_sync_worker+0x17/0x5c [xfs] Aug 16 09:33:40 kernel: [32521.336117] [<f891fbb7>] ? xfssyncd+0x134/0x17d [xfs] Aug 16 09:33:40 kernel: [32521.360498] [<f891fa83>] ? xfssyncd+0x0/0x17d [xfs] Aug 16 09:33:40 kernel: [32521.384211] [<c1043588>] ? kthread+0x61/0x66 Aug 16 09:33:40 kernel: [32521.407890] [<c1043527>] ? kthread+0x0/0x66 Aug 16 09:33:40 kernel: [32521.430876] [<c1003d47>] ? kernel_thread_helper+0x7/0x10 Aug 16 09:33:40 kernel: [32521.453394] INFO: task flush-8:16:12945 blocked for more than 120 seconds. Aug 16 09:33:40 kernel: [32521.476116] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Aug 16 09:33:40 kernel: [32521.498579] flush-8:16 D 00000000 0 12945 2 0x00000000 Aug 16 09:33:40 kernel: [32521.520649] f4e4d540 00000046 e412e940 00000000 00000002 c1418100 c1418100 c14136ac Aug 16 09:33:40 kernel: [32521.542426] f4e4d6fc c2808100 00000000 00000000 000008b4 00000001 f4e4d540 c3c3f0e0 Aug 16 09:33:40 kernel: [32521.563745] 02e905a8 f4e4d6fc 007a5399 00000000 02e905a8 00000000 f4e2db48 00670b98 Aug 16 09:33:40 kernel: [32521.585077] Call Trace: Aug 16 09:33:40 kernel: [32521.605790] [<c126b532>] ? io_schedule+0x5f/0x98 Aug 16 09:33:40 kernel: [32521.626184] [<c1128be0>] ? get_request_wait+0xcb/0x146 Aug 16 09:33:40 kernel: [32521.646133] [<c10437ba>] ? autoremove_wake_function+0x0/0x2d Aug 16 09:33:40 kernel: [32521.665659] [<c112907c>] ? __make_request+0x2cc/0x3d9 Aug 16 09:33:40 kernel: [32521.684716] [<f891796e>] ? xfs_convert_page+0x30a/0x331 [xfs] Aug 16 09:33:40 kernel: [32521.703366] [<c1127cf1>] ? generic_make_request+0x266/0x2b4 Aug 16 09:33:40 kernel: [32521.721644] [<c10cf821>] ? bvec_alloc_bs+0x95/0xaf Aug 16 09:33:40 kernel: [32521.739465] [<c1127dfb>] ? submit_bio+0xbc/0xd6 Aug 16 09:33:40 kernel: [32521.756896] [<c10cfa45>] ? bio_alloc_bioset+0x7b/0xba Aug 16 09:33:40 kernel: [32521.774046] [<f8917af0>] ? xfs_submit_ioend_bio+0x3b/0x44 [xfs] Aug 16 09:33:40 kernel: [32521.790694] [<f8917ba3>] ? xfs_submit_ioend+0xaa/0xc4 [xfs] Aug 16 09:33:40 kernel: [32521.806736] [<f891817d>] ? xfs_page_state_convert+0x5c0/0x61c [xfs] Aug 16 09:33:40 kernel: [32521.822859] [<c113705b>] ? __lookup_tag+0x8e/0xee Aug 16 09:33:40 kernel: [32521.838958] [<f891840d>] ? xfs_vm_writepage+0x91/0xc4 [xfs] Aug 16 09:33:40 kernel: [32521.855039] [<c108bbcc>] ? __writepage+0x8/0x22 Aug 16 09:33:40 kernel: [32521.871067] [<c108c17b>] ? write_cache_pages+0x1af/0x29f Aug 16 09:33:40 kernel: [32521.886616] [<c108bbc4>] ? __writepage+0x0/0x22 Aug 16 09:33:40 kernel: [32521.901593] [<c108c285>] ? generic_writepages+0x1a/0x21 Aug 16 09:33:40 kernel: [32521.916455] [<f8918338>] ? xfs_vm_writepages+0x0/0x38 [xfs] Aug 16 09:33:40 kernel: [32521.931484] [<c108c2a5>] ? do_writepages+0x19/0x25 Aug 16 09:33:40 kernel: [32521.946648] [<c10c80d9>] ? writeback_single_inode+0xc7/0x273 Aug 16 09:33:40 kernel: [32521.961675] [<c10c8c44>] ? writeback_inodes_wb+0x3dd/0x49c Aug 16 09:33:40 kernel: [32521.976831] [<c10c8e18>] ? wb_writeback+0x115/0x178 Aug 16 09:33:40 kernel: [32521.991778] [<c10c901f>] ? wb_do_writeback+0x121/0x131 Aug 16 09:33:40 kernel: [32522.006538] [<c103abaf>] ? process_timeout+0x0/0x5 Aug 16 09:33:40 kernel: [32522.021091] [<c10c9050>] ? bdi_writeback_task+0x21/0x89 Aug 16 09:33:40 kernel: [32522.035493] [<c10979e5>] ? bdi_start_fn+0x59/0xa4 Aug 16 09:33:40 kernel: [32522.049765] [<c109798c>] ? bdi_start_fn+0x0/0xa4 Aug 16 09:33:40 kernel: [32522.063792] [<c1043588>] ? kthread+0x61/0x66 Aug 16 09:33:40 kernel: [32522.077612] [<c1043527>] ? kthread+0x0/0x66 Aug 16 09:33:40 kernel: [32522.091260] [<c1003d47>] ? kernel_thread_helper+0x7/0x10 Aug 16 09:33:40 kernel: [32522.104966] INFO: task smartctl:13098 blocked for more than 120 seconds. Aug 16 09:33:40 kernel: [32522.118883] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Aug 16 09:33:40 kernel: [32522.133012] smartctl D 00000020 0 13098 13097 0x00000000 Aug 16 09:33:40 kernel: [32522.147221] e50b9540 00000086 c11d28a8 00000020 00000770 c1418100 c1418100 c14136ac Aug 16 09:33:40 kernel: [32522.161720] e50b96fc c2808100 00000000 e53e8800 00000000 00000020 c3cec000 c13886c0 Aug 16 09:33:40 kernel: [32522.176217] f99dab68 e50b96fc 007a4f1e 00000001 c4082f24 c4082ed8 00000001 c3c3f0e0 Aug 16 09:33:40 kernel: [32522.190737] Call Trace: Aug 16 09:33:40 kernel: [32522.205038] [<c11d28a8>] ? __netdev_alloc_skb+0x14/0x2d Aug 16 09:33:40 kernel: [32522.219605] [<c126b799>] ? schedule_timeout+0x20/0xb0 Aug 16 09:33:40 kernel: [32522.234144] [<c112820d>] ? blk_peek_request+0x112/0x143 Aug 16 09:33:40 kernel: [32522.248649] [<f85873b6>] ? scsi_request_fn+0x3c1/0x47a [scsi_mod] Aug 16 09:33:40 kernel: [32522.263233] [<c103aba8>] ? del_timer+0x55/0x5c Aug 16 09:33:40 kernel: [32522.277773] [<c126b6a2>] ? wait_for_common+0xa4/0x100 Aug 16 09:33:40 kernel: [32522.292342] [<c102cd8d>] ? default_wake_function+0x0/0x8 Aug 16 09:33:40 kernel: [32522.306958] [<c112b3d1>] ? blk_execute_rq+0x8b/0xb2 Aug 16 09:33:40 kernel: [32522.321569] [<c112b2ac>] ? blk_end_sync_rq+0x0/0x23 Aug 16 09:33:40 kernel: [32522.336070] [<c112b58b>] ? blk_recount_segments+0x13/0x20 Aug 16 09:33:40 kernel: [32522.350583] [<c1127307>] ? blk_rq_bio_prep+0x44/0x74 Aug 16 09:33:40 kernel: [32522.365059] [<c112b0b2>] ? blk_rq_map_kern+0xc5/0xee Aug 16 09:33:40 kernel: [32522.379439] [<c112e2a5>] ? sg_scsi_ioctl+0x221/0x2aa Aug 16 09:33:40 kernel: [32522.393801] [<c112e672>] ? scsi_cmd_ioctl+0x344/0x39a Aug 16 09:33:40 kernel: [32522.408140] [<c1024c87>] ? update_curr+0x106/0x1b3 Aug 16 09:33:40 kernel: [32522.422566] [<c1024c87>] ? update_curr+0x106/0x1b3 Aug 16 09:33:40 kernel: [32522.436832] [<f87676aa>] ? sd_ioctl+0x90/0xb5 [sd_mod] Aug 16 09:33:40 kernel: [32522.451228] [<c112c35f>] ? __blkdev_driver_ioctl+0x53/0x63 Aug 16 09:33:40 kernel: [32522.465689] [<c112cbbf>] ? blkdev_ioctl+0x850/0x891 Aug 16 09:33:40 kernel: [32522.479982] [<c1020474>] ? __wake_up_common+0x34/0x59 Aug 16 09:33:40 kernel: [32522.494138] [<c10244cd>] ? complete+0x28/0x36 Aug 16 09:33:40 kernel: [32522.507986] [<c1086c64>] ? find_get_page+0x1f/0x81 Aug 16 09:33:40 kernel: [32522.521671] [<c10abed5>] ? add_partial+0xe/0x40 Aug 16 09:33:40 kernel: [32522.535285] [<c1086e68>] ? lock_page+0x8/0x1d Aug 16 09:33:40 kernel: [32522.548797] [<c1087432>] ? filemap_fault+0xb5/0x2e6 Aug 16 09:33:40 kernel: [32522.562141] [<c109941c>] ? __do_fault+0x381/0x3b1 Aug 16 09:33:40 kernel: [32522.575441] [<c10d0c30>] ? block_ioctl+0x27/0x2c Aug 16 09:33:40 kernel: [32522.588708] [<c10d0c09>] ? block_ioctl+0x0/0x2c Aug 16 09:33:40 kernel: [32522.601858] [<c10bcd78>] ? vfs_ioctl+0x1c/0x5f Aug 16 09:33:40 kernel: [32522.614917] [<c10bd30c>] ? do_vfs_ioctl+0x4aa/0x4e5 Aug 16 09:33:40 kernel: [32522.627961] [<c10350db>] ? __do_softirq+0x115/0x151 Aug 16 09:33:40 kernel: [32522.640901] [<c126e270>] ? do_page_fault+0x2f1/0x307 Aug 16 09:33:40 kernel: [32522.653803] [<c10bd388>] ? sys_ioctl+0x41/0x58 Aug 16 09:33:40 kernel: [32522.666674] [<c10030fb>] ? sysenter_do_call+0x12/0x28 Then again few messages reset high speed USB device using ehci_hcd and address 2. I have browsed and read similar error reports here and there and I tried: I have upgraded the kernel from v2.6.26-2 to 2.6.32-5, which has not solved the problem. They say, this might a cable problem. I have tried to replace the USB-to-miniUSB cable (that connects external drive with computer) with another one. No changes. Somebody suggests to try another USB port. I have only 4 external USB ports, tried another one with no success. They say to try uhci_hcd. I have unmounted the device, unloaded ehci_hcd and mounted again. The difference was that now in log I get reset full speed USB device using uhci_hcd and address 2 and similar kernel dumps after a while. They say to echo 128 > /sys/block/sdb/device/max_sectors. I tried it with ehci_hcd with no success (note: I have issued this command after the drive was mounted but before using it actively). I have lauched smartmond and checking periodically the output of smartctl: drive temperature is OK, number of bad sectors and uncorrectable errors is 0. Nothing suspicious is reported by S.M.A.R.T. except maybe the following: Aug 16 12:40:12 kernel: [43715.314566] program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO Aug 16 12:40:13 kernel: [43715.705622] program smartctl is using a deprecated SCSI ioctl, please convert it to SG_IO Of course, I have not tried all combinations of above. But unfortunately, I am run out of cardinal ideas. If anybody can advice something specific about the problem, you are very welcome.

    Read the article

  • How to find cause of main file system going to read only mode

    - by user606521
    Ubuntu 12.04 File system goes to readonly mode frequently. First of all I have read this question file system is going into read only mode frequently already. But I have to know if it's not caused by something else than dying hard drive. This is server provided by my client and I am just runing there some node.js workers + one node.js server and I am using mongodb. From time to time (every 20-50h) system suddenly makes filesystem read only, mongodb process fails (due read-only fs) and my node workers/server (which are started by forever) are just killed. Here is the log from dmesg - I can see there some errors and messages that FS is going to read-only, and there is also some JOURNAL error but I would like to find cause of those errors.. http://speedy.sh/Ux2VV/dmesg.log.txt edit smartctl -t long /dev/sda smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.5.0-23-generic] (local build) Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net SMART support is: Unavailable - device lacks SMART capability. A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options. What I am doing wrong? Same is for sda2. Morover now when I type any command that not exists in shell I get this: Sorry, command-not-found has crashed! Please file a bug report at: https://bugs.launchpad.net/command-not-found/+filebug Please include the following information with the report:

    Read the article

  • Hard Disk Not Counting Reallocated Sectors

    - by MetaNova
    I have a drive that is reporting that the current pending sectors is "45". I have used badblocks to identify the sectors and I have been trying to write zeros to them with dd. From what I understand, when I attempt writing data directly to the bad sectors, it should trigger a reallocation, reducing current pending sectors by one and increasing the reallocated sector count. However, on this disk both Reallocated_Sector_Ct and Reallocated_Event_Count raw values are 0, and dd fails with I/O errors when I attempt to write zeros to the bad sectors. dd works fine, however, when I write to a good sector. # dd if=/dev/zero of=/dev/sdb bs=512 count=1 seek=217152 dd: error writing ‘/dev/sdb’: Input/output error Does this mean that my drive, in some way, has no spare sectors to be used for reallocation? Is my drive just in general a terrible person? (The drive isn't actually mine, I'm helping a friend out. They might have just gotten a cheap drive or something.) In case it is relevant, here is the output of smartctl -i : Model Family: Western Digital Caviar Green (AF) Device Model: WDC WD15EARS-00Z5B1 Serial Number: WD-WMAVU3027748 LU WWN Device Id: 5 0014ee 25998d213 Firmware Version: 80.00A80 User Capacity: 1,500,301,910,016 bytes [1.50 TB] Sector Size: 512 bytes logical/physical Device is: In smartctl database [for details use: -P show] ATA Version is: ATA8-ACS (minor revision not indicated) SATA Version is: SATA 2.6, 3.0 Gb/s Local Time is: Fri Oct 18 17:47:29 2013 CDT SMART support is: Available - device has SMART capability. SMART support is: Enabled UPDATE: I have run shred on the disk, which has caused Current_Pending_Sector to go to zero. However, Reallocated_Sector_Ct and Reallocated_Event_Count are still zero, and dd is now able to write data to the sectors it was previously unable to. This leads me with several other questions: Why aren't the reallocations being recored by the disk? I'm assuming the reallocation took place as I can now write data directly to the sector and couldn't before. Why did shred cause reallocation and not dd? Does the fact that shred writes random data instead of just zeros make a difference?

    Read the article

  • Strange File-Server I/O Spikes - What Is Causing This?

    - by CruftRemover
    I am currently having a problem with a small Linux server that is providing file-sharing services to four Windows 7 32-bit clients. The server is an AMD PhenomX3 with two Western Digital 10EADS (1TB) drives, attached to a Gigabyte GA-MA770T-UD3 mainboard and running Ubuntu Server 10.04.1 LTS. The client machines are taking an extremely long time to access/transfer data on the file server. Applications often become non-responsive while trying to open files located remotely, or one program attempting to open a file but having to wait will prevent other software from accessing network resources at all. Other examples include one image taking 20 seconds or more to open, and in one instance a user waited 110 seconds for Microsoft Word 2007 to save a document. I had initially thought the problem was network-related, but this appears not to be the case. All cables and switches have been tested (one cable was replaced) for verification. This was additionally confirmed when closing down all client machines and rebooting the server resulted in the hard-drive light staying on solid during the startup process. For the first 15 minutes during boot, logon and after logging on (with no client machines attached), the system displayed a load average of 4 or higher. Symptoms included waiting several minutes for the logon prompt to appear, and then several minutes for the password prompt to appear after typing in a user name. After logon, it also took upwards of 45 seconds for the 'smartctl' man page to appear after the command 'man smartctl' was issued. After 15 minutes of this behaviour, the load average dropped to around 0.02 and the machine behaved normally. I have also considered that the problem is hard-drive-related, however diagnostic programs reveal no drive problems. Western Digital DLG, Spinrite and SMARTUDM show no abnormal characteristics - the drives are in perfect health as far as the hardware is concerned. I have thus far been completely unable to track down the cause of this problem, so any help is greatly appreciated. Requested Information: Output of 'free' hxxp://pastebin.com/mfsJS8HS (stupid spam filter) The command 'hdparm -d /dev/sda1' reports: HDIO_GET_DMA failed: Inappropriate ioctl for device (the BIOS is set to AHCI - I probably should have mentioned that).

    Read the article

  • WD1000FYPS harddrive is marked 0 mb in 3ware (and no SMART)

    - by osgx
    After reboot my SATA 1TB WD1000FYPS (previously is was "Drive error") is marked 0 mb in 3ware web gui. Complete message: Available Drives (Controller ID 0) Port 1 WDC WD1000FYPS-01ZKB0 0.00 MB NOT SUPPORTED [Remove Drive] SMART gives me only Device Model and ATA protocol version 1 (not 7-8 as it must be for SATA) What does it mean? Just before reboot, when is was marked only with "Device Error", smart was: Device Model: WDC WD1000FYPS-01ZKB0 Serial Number: WD-WCASJ1130*** Firmware Version: 02.01B01 User Capacity: 1,000,204,886,016 bytes Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 8 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Sun Mar 7 18:47:35 2010 MSK SMART support is: Available - device has SMART capability. SMART support is: Enabled SMART overall-health self-assessment test result: PASSED SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0003 188 186 021 Pre-fail Always - 7591 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 229 5 Reallocated_Sector_Ct 0x0033 199 199 140 Pre-fail Always - 3 7 Seek_Error_Rate 0x000e 193 193 000 Old_age Always - 125 9 Power_On_Hours 0x0032 078 078 000 Old_age Always - 16615 10 Spin_Retry_Count 0x0012 100 100 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0012 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 77 192 Power-Off_Retract_Count 0x0032 198 198 000 Old_age Always - 1564 193 Load_Cycle_Count 0x0032 146 146 000 Old_age Always - 164824 194 Temperature_Celsius 0x0022 117 100 000 Old_age Always - 35 196 Reallocated_Event_Count 0x0032 199 199 000 Old_age Always - 1 197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 What can be wrong with he? Can it be restored? PS new smart is === START OF INFORMATION SECTION === Device Model: WDC WD1000FYPS-01ZKB0 Serial Number: [No Information Found] Firmware Version: [No Information Found] Device is: Not in smartctl database [for details use: -P showall] ATA Version is: 1 ATA Standard is: Exact ATA specification draft version not indicated Local Time is: Mon Mar 8 00:29:44 2010 MSK SMART is only available in ATA Version 3 Revision 3 or greater. We will try to proceed in spite of this. SMART support is: Ambiguous - ATA IDENTIFY DEVICE words 82-83 don't show if SMART supported. Checking for SMART support by trying SMART ENABLE command. Command failed, ata.status=(0x00), ata.command=(0x51), ata.flags=(0x01) Error SMART Enable failed: Input/output error SMART ENABLE failed - this establishes that this device lacks SMART functionality. A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options. PPS There was a rapid grow of " 192 Power-Off_Retract_Count " before dying. The hard was used in raid, with several hards from the same fabric packaging box (close id's). The hard drives were placed identically. Rapid means almost linear grow from 300 to 1700 in 6-7 hours. Maximal temperature was 41C. (thanks to munin's smart monitoring)

    Read the article

  • Bad Sectors on Hard Drive

    - by RHPT
    I run check disk pretty regularly on my hard drive, and lately it's been saying that I have some bad sectores (66, to be exact). I've run smartctl and HD Tune. Both tell me that I have bad sectors and the drive is in "pre-fail" stage. The drive is only a couple of years old. How worried should I be? My drive is a FUJITSU MHW2160BJ FFS G2

    Read the article

  • How can one associate a 3ware controller with the corresponding /dev/tw?? device?

    - by barbaz
    I have a few 3ware RAID controllers installed in a system. Is there any way to figure out the mapping between the following identifiers, each describing in a way the very same RAID controller? The tw_cli reported controller id (e.g. c0,c1,c2,...) The corresponding device nodes that allow smartctl access via the 3ware driver (e.g. /dev/twa0, /dev/twa1, /dev/twl0) The block device presented to the system representing a RAID unit (/dev/sda, /dev/sdb,...)

    Read the article

  • SSD becomes hot, disk failure warning

    - by Aegluin
    I have a two weeks old SSD (Kingston SSDnow 64GB). Yesterday, the computer shutdown twice and after rebooting I was bombarded with disk failure warnings. I usually take such warnings serious (and backed up), but skeptical. After cooling down, the laptop boots again and the only red Smart value was the temperature (Ubuntu did not show the temperature of failure, but the at that time 29°). After refreshing the Smart status and doing a "self test", everything is green. Before contacting Kingston support, I would like to know whether it could be due to a software issue: Is it possible that it is false alarm, and how can I check? I installed Ubuntu 12.04 32bit and took care of alignment. I supposed Ubuntu set up with optimal settings for SSDs, how can I check that there was no mistake? The current temperature is around 40-56°. Is such a temperature abnormal for SSDs? Output of sudo smartctl --all /dev/sda: http://pastebin.ubuntu.com/1175940/

    Read the article

  • Unneeded RAID recovery

    - by Shinhan
    I have two software RAID 5s. Today when I turned on my server (after it was off for two weeks when I was on vacation), I got the message that the software RAID is degraded and was offered RAID recovery console or boot degraded. No idea what to do in that recovery console btw, so I just exited it. After the boot finished normally I took a peak at /proc/mdstat but there is nothing amiss there. So I took a look at mdadm --detail /dev/md0 (and md1) and again everything looks fine with Failed Devices being 0 both times. Next I took a look at smartctl --alll /dev/sdX for all of the drives and Reallocated_Sector_Ct is 0. (Lots of other zeros, all numbers look to me fine) Anybody have any ideas why I'm getting the RAID recovery message on boot when nothing seems bad?

    Read the article

  • RAID controller dropping the wrong drive

    - by bramp
    I've been having an issue with 3ware 9500S-8 RAID 10, and I have contracted their tech support, but I wanted to hear the serverfault community's recommendations. Firstly, all my data is backuped and secure, so I don't mind blowing my RAID away if I have to. But let me describe the problem I've been seeing. A month ago, disk 6 dropped out of the RAID. It is mirrored with disk 7, so I wasn't that bothered. I went to the data centre and replaced it. When I got back to the office, I noticed that disk 6 will still not in the RAID, and in fact the controller was show the name of the old drive still. A week later I went back and replace the drive again, thinking I might have swapped in a bad drive. Still the same problem. I decided to reboot the machine, to see if that would "force" the controller into seeing the new drive. It did, and a rebuild started to happen (from disk 7). Eventually both drives were showing as good. A week later, the MySQL database has flagged the database is corrupt, and is unable to repair it. I don't know what has gone wrong, but I suspected this 6-7 pair. At this point I noticed that the RAID had constantly been verifying itself, over and over. Regardless of this I began to rebuild the database, which took about 19 hours. It's a big database. Near the end of the repair, the RAID controller told me it had dropped disk 7, and that some data was most likely corrupted. I contacted LSI tech support, and they very promptly started to help me. I mentioned that drive 7 had been dropped. They suspect that drive 7 was always at fault, and drive 6 had always been good. I want to know how often a RAID controller would drop the wrong drive (in this case dropping drive 6 a month ago, instead of 7). I foolishly didn't run smartctl on the drives before I started swapping them out. I just assumed the RAID controller knew what it was talking about. I think my plan of action is to replace drive 7, rebuild the array from scratch, double check smartctl on ALL the disks, and then start restoring my data again. I would appreciate anyone's input on what the correct procedure for swapping drives is, and how often failures like this happen. If anyone would like more information then I'd be happy to provide it. thanks in advance. Oh some more information. I'm running CentOS 5.3, with two RAID arrays, a simple RAID 1 for the OS, and RAID 10 for the database. Both arrays are on different controllers. The RAID 10 is made of 10 identical ST3640323AS drives, until I swapped in a SAMSUNG HD103SJ last month.

    Read the article

  • Linux software raid robustness

    - by Waxhead
    I have a 4 disk 5TB raid5 setup where a disk is showing signs of going down the drain. It is reporting media errors and from dmesg I can see that several read errors are corrected. smartctl does report "notifications" but no panic so far. Since new disks are rather expensive at the moment I am starting to pondering exactly how robust the linux md layer is. I would appreciate if someone could shed some light on how md actually deals with disk errors. For example how does md deal with write and read errors - what does it (really) take for disk to be rejected from an array. I also read that recently md got support for mapping out bad blocks. Does this mean that the read errors I've had would have been mapped out if I where running kernel 3.1 or would md still try to "work on them" to make them usable.

    Read the article

  • GNU/Linux: SAS-disk detected as /dev/sg7 - not as /dev/sdb

    - by Ole Tange
    I have just installed a SAS disk into a Debian server. It was detected correctly and everything was fine. Then I moved the SAS disk to a different Debian server, the same hardware model and running same version of Debian, but here the SAS disk is detected as /dev/sg7 and not /dev/sdb. smartctl -a /dev/sg7 works fine, but fdisk and cat hang. I tried putting the SAS disk in another slot: Same problem. How can I force the SAS disk to be detected as /dev/sdb? # uname -a Linux maxwell 3.2.0-4-amd64 #1 SMP Debian 3.2.41-2+deb7u2 x86_64 GNU/Linux

    Read the article

  • How reliable is HDD SMART data?

    - by andahlst
    Based on SMART data, you can judge the health of a disk, at least that is the idea. If I, for instance, run sudo smartctl -H /dev/sda on my ArchLinux laptop, it says that the hard drive passed the self tests and that it should be "healthy" based on this. My question is how reliable this information is, or more specifically: If according to the SMART data this disk is healthy, what are the odds of the disk suddenly failing despite this? This assumes the failure is not due to some catastrophic event that impossibly could have been predicted, such as the laptop falling down on the floor causing the drive heads to hit the disk. If the SMART data does not say the disk is in good shape, what are the odds of the disk failing within some amount of time? Is it possible that there will be false positives and how common are these? Of course, I keep backups no matter what. I am mostly curious.

    Read the article

  • Finding the file that is on a bad block on a HFS+ volume (debugfs for HFS+)

    - by Blair Zajac
    I have a drive in our iMac that has bad blocks, as booting from an Ubuntu 11.10 live CD and using ddrescue -f /dev/sda /dev/null finds them. I'd like to get the drive to remap them by writing to the blocks, say using hdparm --write-sector, but I don't want to do this without knowing what's in those blocks and finding the file that owns them, so I can restore the file from another source. I found fileXray but don't feel like spending $79 to map a block to a file and hfsdebug has been taken offline. Are there suggestions on a tool or technique to use? I looked at all the Ubuntu HFS+ packages to see if they could provide this info but nothing jumped out at me. BTW, I used Disk Utility to erase the empty space, but it didn't get any of the bad blocks to be remapped, according to smartctl -A.

    Read the article

  • Various problems with software raid1 array built with Samsung 840 Pro SSDs

    - by Andy B
    I am bringing to ServerFault a problem that is tormenting me for 6+ months. I have a CentOS 6 (64bit) server with an md software raid-1 array with 2 x Samsung 840 Pro SSDs (512GB). Problems: Serious write speed problems: root [~]# time dd if=arch.tar.gz of=test4 bs=2M oflag=sync 146+1 records in 146+1 records out 307191761 bytes (307 MB) copied, 23.6788 s, 13.0 MB/s real 0m23.680s user 0m0.000s sys 0m0.932s When doing the above (or any other larger copy) the load spikes to unbelievable values (even over 100) going up from ~ 1. When doing the above I've also noticed very weird iostat results: Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 0.00 1589.50 0.00 54.00 0.00 13148.00 243.48 0.60 11.17 0.46 2.50 sdb 0.00 1627.50 0.00 16.50 0.00 9524.00 577.21 144.25 1439.33 60.61 100.00 md1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 md2 0.00 0.00 0.00 1602.00 0.00 12816.00 8.00 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 And it keeps it this way until it actually writes the file to the device (out from swap/cache/memory). The problem is that the second SSD in the array has svctm and await roughly 100 times larger than the second. For some reason the wear is different between the 2 members of the array root [~]# smartctl --attributes /dev/sda | grep -i wear 177 Wear_Leveling_Count 0x0013 094% 094 000 Pre-fail Always - 180 root [~]# smartctl --attributes /dev/sdb | grep -i wear 177 Wear_Leveling_Count 0x0013 070% 070 000 Pre-fail Always - 1005 The first SSD has a wear of 6% while the second SSD has a wear of 30%!! It's like the second SSD in the array works at least 5 times as hard as the first one as proven by the first iteration of iostat (the averages since reboot): Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 10.44 51.06 790.39 125.41 8803.98 1633.11 11.40 0.33 0.37 0.06 5.64 sdb 9.53 58.35 322.37 118.11 4835.59 1633.11 14.69 0.33 0.76 0.29 12.97 md1 0.00 0.00 1.88 1.33 15.07 10.68 8.00 0.00 0.00 0.00 0.00 md2 0.00 0.00 1109.02 173.12 10881.59 1620.39 9.75 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.41 0.01 3.10 0.02 7.42 0.00 0.00 0.00 0.00 What I've tried: I've updated the firmware to DXM05B0Q (following reports of dramatic improvements for 840Ps after this update). I have looked for "hard resetting link" in dmesg to check for cable/backplane issues but nothing. I have checked the alignment and I believe they are aligned correctly (1MB boundary, listing below) I have checked /proc/mdstat and the array is Optimal (second listing below). root [~]# fdisk -ul /dev/sda Disk /dev/sda: 512.1 GB, 512110190592 bytes 255 heads, 63 sectors/track, 62260 cylinders, total 1000215216 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk identifier: 0x00026d59 Device Boot Start End Blocks Id System /dev/sda1 2048 4196351 2097152 fd Linux raid autodetect Partition 1 does not end on cylinder boundary. /dev/sda2 * 4196352 4605951 204800 fd Linux raid autodetect Partition 2 does not end on cylinder boundary. /dev/sda3 4605952 814106623 404750336 fd Linux raid autodetect root [~]# fdisk -ul /dev/sdb Disk /dev/sdb: 512.1 GB, 512110190592 bytes 255 heads, 63 sectors/track, 62260 cylinders, total 1000215216 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk identifier: 0x0003dede Device Boot Start End Blocks Id System /dev/sdb1 2048 4196351 2097152 fd Linux raid autodetect Partition 1 does not end on cylinder boundary. /dev/sdb2 * 4196352 4605951 204800 fd Linux raid autodetect Partition 2 does not end on cylinder boundary. /dev/sdb3 4605952 814106623 404750336 fd Linux raid autodetect /proc/mdstat root # cat /proc/mdstat Personalities : [raid1] md0 : active raid1 sdb2[1] sda2[0] 204736 blocks super 1.0 [2/2] [UU] md2 : active raid1 sdb3[1] sda3[0] 404750144 blocks super 1.0 [2/2] [UU] md1 : active raid1 sdb1[1] sda1[0] 2096064 blocks super 1.1 [2/2] [UU] unused devices: Running a read test with hdparm root [~]# hdparm -t /dev/sda /dev/sda: Timing buffered disk reads: 664 MB in 3.00 seconds = 221.33 MB/sec root [~]# hdparm -t /dev/sdb /dev/sdb: Timing buffered disk reads: 288 MB in 3.01 seconds = 95.77 MB/sec But look what happens if I add --direct root [~]# hdparm --direct -t /dev/sda /dev/sda: Timing O_DIRECT disk reads: 788 MB in 3.01 seconds = 262.08 MB/sec root [~]# hdparm --direct -t /dev/sdb /dev/sdb: Timing O_DIRECT disk reads: 534 MB in 3.02 seconds = 176.90 MB/sec Both tests increase but /dev/sdb doubles while /dev/sda increases maybe 20%. I just don't know what to make of this. As suggested by Mr. Wagner I've done another read test with dd this time and it confirms the hdparm test: root [/home2]# dd if=/dev/sda of=/dev/null bs=1G count=10 10+0 records in 10+0 records out 10737418240 bytes (11 GB) copied, 38.0855 s, 282 MB/s root [/home2]# dd if=/dev/sdb of=/dev/null bs=1G count=10 10+0 records in 10+0 records out 10737418240 bytes (11 GB) copied, 115.24 s, 93.2 MB/s So sda is 3 times faster than sdb. Or maybe sdb is doing also something else besides what sda does. Is there some way to find out if sdb is doing more than what sda does? UPDATE Again, as suggested by Mr. Wagner, I have swapped the 2 SSDs. And as he thought it would happen, the problem moved from sdb to sda. So I guess I'll RMA one of the SSDs. I wonder if the cage might be problematic. What is wrong with this array? Please help!

    Read the article

  • What does the the reconstruction process of mdadm do exactly on raid10

    - by Azrael
    I've got a system with 4 disks set up as raid10. All disks are usable, and mdadm all states them with UUUU. Due to a recent system crash, the raid is currently reconstruction the raid as it was marked as "not clean," and a reconstruction process was started. On a closer look smartctl shows problems on one disk: sd 0:0:0:0: [sda] Unhandled sense code sd 0:0:0:0: [sda] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE sd 0:0:0:0: [sda] Sense Key : Medium Error [current] [descriptor] Descriptor sense data with sense descriptors (in hex): 72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00 24 cd 78 d4 sd 0:0:0:0: [sda] Add. Sense: Unrecovered read error - auto reallocate failed sd 0:0:0:0: [sda] CDB: Read(10): 28 00 24 cd 75 1e 00 04 00 00 With a research about the reconstruction process, I only found information concerning raid5 but nothing for raid10. Can I replace this problematic disk during the reconstruction process, or will I kill the raid with this?

    Read the article

  • Post raid5 setup reboot shows single hard drive failure on ubuntu 12.10?

    - by junkie
    I just set up raid 5 on linux using three HDDs as per a guide. It all went fine until when I rebooted I got the following text: http://i.stack.imgur.com/Zsfjk.jpg. Does this mean one of my HDDs has failed? How do I check if any of them are failing? I tried using smartctl and didn't see any issues. Or is it nothing to do with failure and something else altogether? I would like to get the raid 5 working again but I'm not sure where to go from here. I'm using ubuntu 12.10 and the three raid disks each have a gpt partition with a single full size partition of filesystem type ext4. Note I only got an error on reboot not while I was creating the raid array which went fine. Thanks.

    Read the article

  • Bad DMA/do_IRQ errors on suspend/resume, with occasional freezing

    - by Steve Kroon
    Every time I suspend or resume my laptop (Dell Latitude E6520, bought this year), I get 2 messages of the form displayed on the console just before shutting down/starting up: [ 407.107610] ehci_hcd 0000:00:1d.0: dma_pool_free buffer-128, f6f18000/36f18000 (bad dma) On occasion, I get a message of the form: [ 3753.979066] do_IRQ: 0.177 No irq handler for vector (irq -1) On occasion, my machine freezes with a flashing Caps Lock button when suspending, after which I need to do a hard shutdown. This never happened before the messages started appearing (a while back), and I think it never happens without a do_IRQ message appearing (although I'm not sure about that). [There's nothing in the owner's manual on a flashing Caps Lock button; apparently it may be a kernel panic if the scroll lock also flashes, but the laptop doesn't have a scroll lock light, and there's no message on the console saying kernel panic.] Are these bad DMA/do IRQ messages serious, and what can I do to investigate/troubleshoot them and the freezing? Edit: I've also now received the following error messages a few times: [246943.023908] JBD: I/O error detected when updating journal superblock for sdb1. [246943.023958] Buffer I/O error on device sdb1, logical block 0 [246943.023996] EXT3-fs (sdb1): I/O error while writing superblock Edit: Output of dmesg at http://pastebin.com/ra7MTQEj ; contents of /var/log/kern.log at http://pastebin.com/i6jf0Md9 Edit: the output of some smartctl (-a, -x, --log=error, --log=xerror) instructions is available at http://paste.ubuntu.com/1088488/ . Edit (31/8/2012): Output of dmesg|grep -i ehci available at http://paste.ubuntu.com/1177246/ .

    Read the article

< Previous Page | 1 2 3  | Next Page >