Support #13235
Server hardware failure (sdc hard drive)
100%
Description
Hi Philippe,
The Gisaf is not Connecting.
so can you please check it.
I have attached the png for your reference.
Thank you.
History
#1 Updated by Philippe May about 3 years ago
- Project changed from Gisaf to GIS
- Status changed from New to In Progress
Definitely not a problem with Gisaf, but with the server.
I could connect with VPN on gisdb.csr.av, but not with ssh to the dom0 (dream.csr.av). Coming to CSR office: the server console shown a lot of messages related to disk OK errors on sdc2.
Rebooted in safe mode: could not fsck /dev/sdc2 since it's the root FS. Continued the boot and the services and VMs started fine.
BUT: it's a sign that the sdc drive is having issues, and action needed very soon.
#2 Updated by Philippe May about 3 years ago
Some information:
smartctl -a /dev/sdc¶
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.0-9-amd64] (local build) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Seagate BarraCuda 3.5 Device Model: ST2000DM006-2DM164 Serial Number: Z8E0SXSW LU WWN Device Id: 5 000c50 0a5b727e4 Firmware Version: CC26 User Capacity: 2,000,398,934,016 bytes [2.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: 7200 rpm Form Factor: 3.5 inches Device is: In smartctl database [for details use: -P show] ATA Version is: ACS-2, ACS-3 T13/2161-D revision 3b SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Thu Nov 18 16:32:05 2021 IST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 80) seconds. Offline data collection capabilities: (0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 210) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x1085) SCT Status supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 058 057 006 Pre-fail Always - 95563133 3 Spin_Up_Time 0x0003 096 096 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 465 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 073 060 030 Pre-fail Always - 23428726 9 Power_On_Hours 0x0032 063 063 000 Old_age Always - 33124 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 469 183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 001 001 000 Old_age Always - 65535 188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0 0 0 189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0 190 Airflow_Temperature_Cel 0x0022 056 049 045 Old_age Always - 44 (Min/Max 44/46) 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 406 193 Load_Cycle_Count 0x0032 093 093 000 Old_age Always - 14277 194 Temperature_Celsius 0x0022 044 051 000 Old_age Always - 44 (0 24 0 0 0) 197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 8 198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 8 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 32007h+16m+24.524s 241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 4757015136 242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 6203030490 SMART Error Log Version: 1 ATA Error Count: 2030 (device log contains only the most recent five errors) CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 2030 occurred at disk power-on lifetime: 33124 hours (1380 days + 4 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 53 00 ff ff ff 0f Error: WP at LBA = 0x0fffffff = 268435455 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 61 00 58 ff ff ff 4f 00 00:54:01.824 WRITE FPDMA QUEUED 60 00 08 ff ff ff 4f 00 00:53:59.625 READ FPDMA QUEUED 61 00 08 ff ff ff 4f 00 00:53:59.625 WRITE FPDMA QUEUED 61 00 08 ff ff ff 4f 00 00:53:59.625 WRITE FPDMA QUEUED 61 00 08 ff ff ff 4f 00 00:53:59.625 WRITE FPDMA QUEUED Error 2029 occurred at disk power-on lifetime: 33124 hours (1380 days + 4 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 53 00 ff ff ff 0f Error: WP at LBA = 0x0fffffff = 268435455 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 61 00 08 ff ff ff 4f 00 00:53:57.213 WRITE FPDMA QUEUED 61 00 08 ff ff ff 4f 00 00:53:57.213 WRITE FPDMA QUEUED 61 00 08 ff ff ff 4f 00 00:53:57.213 WRITE FPDMA QUEUED 60 00 08 ff ff ff 4f 00 00:53:55.970 READ FPDMA QUEUED ea 00 00 00 00 00 a0 00 00:53:55.962 FLUSH CACHE EXT Error 2028 occurred at disk power-on lifetime: 33123 hours (1380 days + 3 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 53 00 ff ff ff 0f Error: WP at LBA = 0x0fffffff = 268435455 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 61 00 50 ff ff ff 4f 00 00:14:00.991 WRITE FPDMA QUEUED 60 00 08 ff ff ff 4f 00 00:14:00.991 READ FPDMA QUEUED ef 10 02 00 00 00 a0 00 00:14:00.987 SET FEATURES [Enable SATA feature] 27 00 00 00 00 00 e0 00 00:14:00.987 READ NATIVE MAX ADDRESS EXT [OBS-ACS-3] ec 00 00 00 00 00 a0 00 00:14:00.986 IDENTIFY DEVICE Error 2027 occurred at disk power-on lifetime: 33123 hours (1380 days + 3 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 53 00 ff ff ff 0f Error: WP at LBA = 0x0fffffff = 268435455 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 61 00 50 ff ff ff 4f 00 00:13:59.636 WRITE FPDMA QUEUED 60 00 08 ff ff ff 4f 00 00:13:57.368 READ FPDMA QUEUED 60 00 08 ff ff ff 4f 00 00:13:57.367 READ FPDMA QUEUED 60 00 08 ff ff ff 4f 00 00:13:57.367 READ FPDMA QUEUED 60 00 08 ff ff ff 4f 00 00:13:57.367 READ FPDMA QUEUED Error 2026 occurred at disk power-on lifetime: 33123 hours (1380 days + 3 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 53 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 60 00 28 ff ff ff 4f 00 00:13:53.658 READ FPDMA QUEUED 60 00 08 ff ff ff 4f 00 00:13:53.657 READ FPDMA QUEUED 60 00 08 ff ff ff 4f 00 00:13:53.657 READ FPDMA QUEUED 60 00 08 ff ff ff 4f 00 00:13:53.656 READ FPDMA QUEUED 60 00 20 ff ff ff 4f 00 00:13:53.656 READ FPDMA QUEUED SMART Self-test log structure revision number 1 Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Short offline Completed without error 00% 7346 - SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.
lsblk¶
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 0 1.8T 0 disk ├─sda1 8:1 0 512M 0 part /boot/efi ├─sda2 8:2 0 23.3G 0 part ├─sda3 8:3 0 9.3G 0 part ├─sda4 8:4 0 7.9G 0 part ├─sda5 8:5 0 1.9G 0 part └─sda6 8:6 0 1.8T 0 part └─md0 9:0 0 1.8T 0 raid1 ├─dream.csr-gisaf.csr.av--swap 253:0 0 8G 0 lvm ├─dream.csr-gisaf.csr.av--disk 253:1 0 1000G 0 lvm ├─dream.csr-infra.csr.av--swap 253:2 0 1G 0 lvm ├─dream.csr-infra.csr.av--disk 253:3 0 10G 0 lvm ├─dream.csr-samba.csr.av--swap 253:4 0 1G 0 lvm ├─dream.csr-samba.csr.av--disk 253:5 0 100G 0 lvm ├─dream.csr-gisaf2.csr.av--swap 253:6 0 1G 0 lvm ├─dream.csr-gisaf2.csr.av--disk 253:7 0 10G 0 lvm ├─dream.csr-gisdb.csr.av--swap 253:8 0 1G 0 lvm ├─dream.csr-gisdb.csr.av--disk 253:9 0 10G 0 lvm ├─dream.csr-jupyter.csr.av--swap 253:10 0 1G 0 lvm └─dream.csr-jupyter.csr.av--disk 253:11 0 20G 0 lvm sdb 8:16 0 931.5G 0 disk ├─sdb1 8:17 0 512M 0 part ├─sdb2 8:18 0 23.3G 0 part ├─sdb3 8:19 0 9.3G 0 part ├─sdb4 8:20 0 7.9G 0 part └─sdb5 8:21 0 1.9G 0 part sdc 8:32 0 1.8T 0 disk ├─sdc1 8:33 0 1.3T 0 part /var/backups ├─sdc2 8:34 0 23.4G 0 part / └─sdc3 8:35 0 7.5G 0 part [SWAP] sdd 8:48 0 1.8T 0 disk ├─sdd1 8:49 0 512M 0 part ├─sdd2 8:50 0 23.3G 0 part ├─sdd3 8:51 0 9.3G 0 part ├─sdd4 8:52 0 7.9G 0 part ├─sdd5 8:53 0 1.9G 0 part └─sdd6 8:54 0 1.8T 0 part └─md0 9:0 0 1.8T 0 raid1 ├─dream.csr-gisaf.csr.av--swap 253:0 0 8G 0 lvm ├─dream.csr-gisaf.csr.av--disk 253:1 0 1000G 0 lvm ├─dream.csr-infra.csr.av--swap 253:2 0 1G 0 lvm ├─dream.csr-infra.csr.av--disk 253:3 0 10G 0 lvm ├─dream.csr-samba.csr.av--swap 253:4 0 1G 0 lvm ├─dream.csr-samba.csr.av--disk 253:5 0 100G 0 lvm ├─dream.csr-gisaf2.csr.av--swap 253:6 0 1G 0 lvm ├─dream.csr-gisaf2.csr.av--disk 253:7 0 10G 0 lvm ├─dream.csr-gisdb.csr.av--swap 253:8 0 1G 0 lvm ├─dream.csr-gisdb.csr.av--disk 253:9 0 10G 0 lvm ├─dream.csr-jupyter.csr.av--swap 253:10 0 1G 0 lvm └─dream.csr-jupyter.csr.av--disk 253:11 0 20G 0 lvm
lshw -c disk¶
*-disk:0 description: ATA Disk product: ST2000DM006-2DM1 physical id: 0 bus info: scsi@0:0.0.0 logical name: /dev/sda version: CC26 serial: Z8E0QMHC size: 1863GiB (2TB) capabilities: gpt-1.00 partitioned partitioned:gpt configuration: ansiversion=5 guid=674b4b97-ac3c-4564-b4fe-e7f7be9b4fa9 logicalsectorsize=512 sectorsize=4096 *-disk:1 description: ATA Disk product: TOSHIBA DT01ACA1 vendor: Toshiba physical id: 1 bus info: scsi@1:0.0.0 logical name: /dev/sdb version: A810 serial: 77I739JMS size: 931GiB (1TB) capabilities: gpt-1.00 partitioned partitioned:gpt configuration: ansiversion=5 guid=59098f62-b43f-4f54-b510-3548c7c21fec logicalsectorsize=512 sectorsize=4096 *-disk:2 description: ATA Disk product: ST2000DM006-2DM1 physical id: 2 bus info: scsi@2:0.0.0 logical name: /dev/sdc version: CC26 serial: Z8E0SXSW size: 1863GiB (2TB) capabilities: gpt-1.00 partitioned partitioned:gpt configuration: ansiversion=5 guid=2883c10e-8661-48e9-af21-81073d972719 logicalsectorsize=512 sectorsize=4096 *-disk:3 description: ATA Disk product: ST2000DM006-2DM1 physical id: 3 bus info: scsi@3:0.0.0 logical name: /dev/sdd version: CC26 serial: Z4Z98DCQ size: 1863GiB (2TB) capabilities: gpt-1.00 partitioned partitioned:gpt configuration: ansiversion=5 guid=59098f62-b43f-4f54-b510-3548c7c21fec logicalsectorsize=512 sectorsize=4096
#3 Updated by Philippe May about 3 years ago
- Subject changed from Gisaf is not connecting to Server hardware failure (sdc hard drive)
#4 Updated by Philippe May about 3 years ago
sdb is not used (was used initially during server install) and does not show any error.
So, i re-partitioned it with a sdb2 ext4 partition, and:
root@dream:~# dd if=/dev/sdc2 of=/dev/sdb2 bs=64K conv=noerror,sync dd: error reading '/dev/sdc2': Input/output error 102468+1 records in 102469+0 records out 6715408384 bytes (6.7 GB, 6.3 GiB) copied, 77.6204 s, 86.5 MB/s 382991+1 records in 382992+0 records out 25099763712 bytes (25 GB, 23 GiB) copied, 261.595 s, 95.9 MB/s
Also, i copied the EFI:
root@dream:~# dd if=/dev/sda1 of=/dev/sdb1 bs=64K conv=noerror,sync 8192+0 records in 8192+0 records out 536870912 bytes (537 MB, 512 MiB) copied, 5.4858 s, 97.9 MB/s
Since there was 1 error reported by dd: complete fsck:
root@dream:~# fsck -pvcf /dev/sdb2 fsck from util-linux 2.36.1 /dev/sdb2: Updating bad block inode. 46166 inodes used (3.01%, out of 1534896) 185 non-contiguous files (0.4%) 76 non-contiguous directories (0.2%) # of inodes with ind/dind/tind blocks: 0/0/0 Extent depth histogram: 41521/139/4 1009063 blocks used (16.47%, out of 6127616) 0 bad blocks 1 large file 36869 regular files 4699 directories 7 character device files 0 block device files 0 fifos 13 links 4580 symbolic links (4485 fast symbolic links) 2 sockets ------------ 46170 files
Looks like all is OK.
Next: config to boot with root on /dev/sdb2
#5 Updated by Philippe May about 3 years ago
- % Done changed from 0 to 100
- Status changed from In Progress to Resolved
Today the dom0 (dream.csr.av) was not accessible.
I came to CSR again this afternoon and rebooted the server, manually starting it with root set to sdb2 with grub options. So far so good: no error.
Try to make sure that the server's EFI config boots on sdb2:
- changed /etc/fstab to make sure that /dev/sda1 is used as /boot/efi (needed because of the confusion of UUID since i cloned yesterday with dd)
- mount manually /dev/sda1 as /boot/efi
- ran grub-install: hopefully the server will eventually reboot on sdb2.
~~~
So, in short, this new setup should prevent the need for replacement of hardware.
4 hard drives:
- sdc (size: 2TB) has shown a worrying error: 1 bad sector. The rest (backup space) seems alright.
- sdb: (1TB): used for the dom0, seems all OK.
- sda and sdd: RAID for all application, data and VMs: all OK.
~~~
Marking this ticket as resolved. Will follow up eventually if there's an issue when the server restarts.