Project

General

Profile

Support #13235

Server hardware failure (sdc hard drive)

Added by Selvarani C over 2 years ago. Updated over 2 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Start date:
18/11/2021
Due date:
% Done:

100%

Close

Description

Hi Philippe,

The Gisaf is not Connecting.
so can you please check it.

I have attached the png for your reference.

Thank you.

Not Connected.png View (68.4 KB) Selvarani C, 18/11/2021 15:16

12828

History

#1 Updated by Philippe May over 2 years ago

  • Project changed from Gisaf to GIS
  • Status changed from New to In Progress

Definitely not a problem with Gisaf, but with the server.

I could connect with VPN on gisdb.csr.av, but not with ssh to the dom0 (dream.csr.av). Coming to CSR office: the server console shown a lot of messages related to disk OK errors on sdc2.

Rebooted in safe mode: could not fsck /dev/sdc2 since it's the root FS. Continued the boot and the services and VMs started fine.

BUT: it's a sign that the sdc drive is having issues, and action needed very soon.

#2 Updated by Philippe May over 2 years ago

Some information:

smartctl -a /dev/sdc

smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.10.0-9-amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate BarraCuda 3.5
Device Model:     ST2000DM006-2DM164
Serial Number:    Z8E0SXSW
LU WWN Device Id: 5 000c50 0a5b727e4
Firmware Version: CC26
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Nov 18 16:32:05 2021 IST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)    Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever 
                    been run.
Total time to complete Offline 
data collection:         (   80) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:      (   1) minutes.
Extended self-test routine
recommended polling time:      ( 210) minutes.
Conveyance self-test routine
recommended polling time:      (   2) minutes.
SCT capabilities:            (0x1085)    SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   058   057   006    Pre-fail  Always       -       95563133
  3 Spin_Up_Time            0x0003   096   096   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       465
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   073   060   030    Pre-fail  Always       -       23428726
  9 Power_On_Hours          0x0032   063   063   000    Old_age   Always       -       33124
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       469
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   001   001   000    Old_age   Always       -       65535
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0 0 0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   056   049   045    Old_age   Always       -       44 (Min/Max 44/46)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       406
193 Load_Cycle_Count        0x0032   093   093   000    Old_age   Always       -       14277
194 Temperature_Celsius     0x0022   044   051   000    Old_age   Always       -       44 (0 24 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       8
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       8
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       32007h+16m+24.524s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       4757015136
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       6203030490

SMART Error Log Version: 1
ATA Error Count: 2030 (device log contains only the most recent five errors)
    CR = Command Register [HEX]
    FR = Features Register [HEX]
    SC = Sector Count Register [HEX]
    SN = Sector Number Register [HEX]
    CL = Cylinder Low Register [HEX]
    CH = Cylinder High Register [HEX]
    DH = Device/Head Register [HEX]
    DC = Device Command Register [HEX]
    ER = Error register [HEX]
    ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 2030 occurred at disk power-on lifetime: 33124 hours (1380 days + 4 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 53 00 ff ff ff 0f  Error: WP at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 00 58 ff ff ff 4f 00      00:54:01.824  WRITE FPDMA QUEUED
  60 00 08 ff ff ff 4f 00      00:53:59.625  READ FPDMA QUEUED
  61 00 08 ff ff ff 4f 00      00:53:59.625  WRITE FPDMA QUEUED
  61 00 08 ff ff ff 4f 00      00:53:59.625  WRITE FPDMA QUEUED
  61 00 08 ff ff ff 4f 00      00:53:59.625  WRITE FPDMA QUEUED

Error 2029 occurred at disk power-on lifetime: 33124 hours (1380 days + 4 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 53 00 ff ff ff 0f  Error: WP at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 00 08 ff ff ff 4f 00      00:53:57.213  WRITE FPDMA QUEUED
  61 00 08 ff ff ff 4f 00      00:53:57.213  WRITE FPDMA QUEUED
  61 00 08 ff ff ff 4f 00      00:53:57.213  WRITE FPDMA QUEUED
  60 00 08 ff ff ff 4f 00      00:53:55.970  READ FPDMA QUEUED
  ea 00 00 00 00 00 a0 00      00:53:55.962  FLUSH CACHE EXT

Error 2028 occurred at disk power-on lifetime: 33123 hours (1380 days + 3 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 53 00 ff ff ff 0f  Error: WP at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 00 50 ff ff ff 4f 00      00:14:00.991  WRITE FPDMA QUEUED
  60 00 08 ff ff ff 4f 00      00:14:00.991  READ FPDMA QUEUED
  ef 10 02 00 00 00 a0 00      00:14:00.987  SET FEATURES [Enable SATA feature]
  27 00 00 00 00 00 e0 00      00:14:00.987  READ NATIVE MAX ADDRESS EXT [OBS-ACS-3]
  ec 00 00 00 00 00 a0 00      00:14:00.986  IDENTIFY DEVICE

Error 2027 occurred at disk power-on lifetime: 33123 hours (1380 days + 3 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 53 00 ff ff ff 0f  Error: WP at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 00 50 ff ff ff 4f 00      00:13:59.636  WRITE FPDMA QUEUED
  60 00 08 ff ff ff 4f 00      00:13:57.368  READ FPDMA QUEUED
  60 00 08 ff ff ff 4f 00      00:13:57.367  READ FPDMA QUEUED
  60 00 08 ff ff ff 4f 00      00:13:57.367  READ FPDMA QUEUED
  60 00 08 ff ff ff 4f 00      00:13:57.367  READ FPDMA QUEUED

Error 2026 occurred at disk power-on lifetime: 33123 hours (1380 days + 3 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 53 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 28 ff ff ff 4f 00      00:13:53.658  READ FPDMA QUEUED
  60 00 08 ff ff ff 4f 00      00:13:53.657  READ FPDMA QUEUED
  60 00 08 ff ff ff 4f 00      00:13:53.657  READ FPDMA QUEUED
  60 00 08 ff ff ff 4f 00      00:13:53.656  READ FPDMA QUEUED
  60 00 20 ff ff ff 4f 00      00:13:53.656  READ FPDMA QUEUED

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      7346         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

lsblk

NAME                                 MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda                                    8:0    0   1.8T  0 disk  
├─sda1                                 8:1    0   512M  0 part  /boot/efi
├─sda2                                 8:2    0  23.3G  0 part  
├─sda3                                 8:3    0   9.3G  0 part  
├─sda4                                 8:4    0   7.9G  0 part  
├─sda5                                 8:5    0   1.9G  0 part  
└─sda6                                 8:6    0   1.8T  0 part  
  └─md0                                9:0    0   1.8T  0 raid1 
    ├─dream.csr-gisaf.csr.av--swap   253:0    0     8G  0 lvm   
    ├─dream.csr-gisaf.csr.av--disk   253:1    0  1000G  0 lvm   
    ├─dream.csr-infra.csr.av--swap   253:2    0     1G  0 lvm   
    ├─dream.csr-infra.csr.av--disk   253:3    0    10G  0 lvm   
    ├─dream.csr-samba.csr.av--swap   253:4    0     1G  0 lvm   
    ├─dream.csr-samba.csr.av--disk   253:5    0   100G  0 lvm   
    ├─dream.csr-gisaf2.csr.av--swap  253:6    0     1G  0 lvm   
    ├─dream.csr-gisaf2.csr.av--disk  253:7    0    10G  0 lvm   
    ├─dream.csr-gisdb.csr.av--swap   253:8    0     1G  0 lvm   
    ├─dream.csr-gisdb.csr.av--disk   253:9    0    10G  0 lvm   
    ├─dream.csr-jupyter.csr.av--swap 253:10   0     1G  0 lvm   
    └─dream.csr-jupyter.csr.av--disk 253:11   0    20G  0 lvm   
sdb                                    8:16   0 931.5G  0 disk  
├─sdb1                                 8:17   0   512M  0 part  
├─sdb2                                 8:18   0  23.3G  0 part  
├─sdb3                                 8:19   0   9.3G  0 part  
├─sdb4                                 8:20   0   7.9G  0 part  
└─sdb5                                 8:21   0   1.9G  0 part  
sdc                                    8:32   0   1.8T  0 disk  
├─sdc1                                 8:33   0   1.3T  0 part  /var/backups
├─sdc2                                 8:34   0  23.4G  0 part  /
└─sdc3                                 8:35   0   7.5G  0 part  [SWAP]
sdd                                    8:48   0   1.8T  0 disk  
├─sdd1                                 8:49   0   512M  0 part  
├─sdd2                                 8:50   0  23.3G  0 part  
├─sdd3                                 8:51   0   9.3G  0 part  
├─sdd4                                 8:52   0   7.9G  0 part  
├─sdd5                                 8:53   0   1.9G  0 part  
└─sdd6                                 8:54   0   1.8T  0 part  
  └─md0                                9:0    0   1.8T  0 raid1 
    ├─dream.csr-gisaf.csr.av--swap   253:0    0     8G  0 lvm   
    ├─dream.csr-gisaf.csr.av--disk   253:1    0  1000G  0 lvm   
    ├─dream.csr-infra.csr.av--swap   253:2    0     1G  0 lvm   
    ├─dream.csr-infra.csr.av--disk   253:3    0    10G  0 lvm   
    ├─dream.csr-samba.csr.av--swap   253:4    0     1G  0 lvm   
    ├─dream.csr-samba.csr.av--disk   253:5    0   100G  0 lvm   
    ├─dream.csr-gisaf2.csr.av--swap  253:6    0     1G  0 lvm   
    ├─dream.csr-gisaf2.csr.av--disk  253:7    0    10G  0 lvm   
    ├─dream.csr-gisdb.csr.av--swap   253:8    0     1G  0 lvm   
    ├─dream.csr-gisdb.csr.av--disk   253:9    0    10G  0 lvm   
    ├─dream.csr-jupyter.csr.av--swap 253:10   0     1G  0 lvm   
    └─dream.csr-jupyter.csr.av--disk 253:11   0    20G  0 lvm   

lshw -c disk

  *-disk:0
       description: ATA Disk
       product: ST2000DM006-2DM1
       physical id: 0
       bus info: scsi@0:0.0.0
       logical name: /dev/sda
       version: CC26
       serial: Z8E0QMHC
       size: 1863GiB (2TB)
       capabilities: gpt-1.00 partitioned partitioned:gpt
       configuration: ansiversion=5 guid=674b4b97-ac3c-4564-b4fe-e7f7be9b4fa9 logicalsectorsize=512 sectorsize=4096
  *-disk:1
       description: ATA Disk
       product: TOSHIBA DT01ACA1
       vendor: Toshiba
       physical id: 1
       bus info: scsi@1:0.0.0
       logical name: /dev/sdb
       version: A810
       serial: 77I739JMS
       size: 931GiB (1TB)
       capabilities: gpt-1.00 partitioned partitioned:gpt
       configuration: ansiversion=5 guid=59098f62-b43f-4f54-b510-3548c7c21fec logicalsectorsize=512 sectorsize=4096
  *-disk:2
       description: ATA Disk
       product: ST2000DM006-2DM1
       physical id: 2
       bus info: scsi@2:0.0.0
       logical name: /dev/sdc
       version: CC26
       serial: Z8E0SXSW
       size: 1863GiB (2TB)
       capabilities: gpt-1.00 partitioned partitioned:gpt
       configuration: ansiversion=5 guid=2883c10e-8661-48e9-af21-81073d972719 logicalsectorsize=512 sectorsize=4096
  *-disk:3
       description: ATA Disk
       product: ST2000DM006-2DM1
       physical id: 3
       bus info: scsi@3:0.0.0
       logical name: /dev/sdd
       version: CC26
       serial: Z4Z98DCQ
       size: 1863GiB (2TB)
       capabilities: gpt-1.00 partitioned partitioned:gpt
       configuration: ansiversion=5 guid=59098f62-b43f-4f54-b510-3548c7c21fec logicalsectorsize=512 sectorsize=4096

#3 Updated by Philippe May over 2 years ago

  • Subject changed from Gisaf is not connecting to Server hardware failure (sdc hard drive)

#4 Updated by Philippe May over 2 years ago

sdb is not used (was used initially during server install) and does not show any error.

So, i re-partitioned it with a sdb2 ext4 partition, and:

root@dream:~# dd if=/dev/sdc2 of=/dev/sdb2 bs=64K conv=noerror,sync
dd: error reading '/dev/sdc2': Input/output error
102468+1 records in
102469+0 records out
6715408384 bytes (6.7 GB, 6.3 GiB) copied, 77.6204 s, 86.5 MB/s
382991+1 records in
382992+0 records out
25099763712 bytes (25 GB, 23 GiB) copied, 261.595 s, 95.9 MB/s

Also, i copied the EFI:

root@dream:~# dd if=/dev/sda1 of=/dev/sdb1 bs=64K conv=noerror,sync                                                                             8192+0 records in                                                                                                                               
8192+0 records out                                                                                                                              
536870912 bytes (537 MB, 512 MiB) copied, 5.4858 s, 97.9 MB/s

Since there was 1 error reported by dd: complete fsck:

root@dream:~# fsck -pvcf /dev/sdb2
fsck from util-linux 2.36.1
/dev/sdb2: Updating bad block inode.

       46166 inodes used (3.01%, out of 1534896)
         185 non-contiguous files (0.4%)
          76 non-contiguous directories (0.2%)
             # of inodes with ind/dind/tind blocks: 0/0/0
             Extent depth histogram: 41521/139/4
     1009063 blocks used (16.47%, out of 6127616)
           0 bad blocks
           1 large file

       36869 regular files
        4699 directories
           7 character device files
           0 block device files
           0 fifos
          13 links
        4580 symbolic links (4485 fast symbolic links)
           2 sockets
------------
       46170 files

Looks like all is OK.

Next: config to boot with root on /dev/sdb2

#5 Updated by Philippe May over 2 years ago

  • % Done changed from 0 to 100
  • Status changed from In Progress to Resolved

Today the dom0 (dream.csr.av) was not accessible.

I came to CSR again this afternoon and rebooted the server, manually starting it with root set to sdb2 with grub options. So far so good: no error.

Try to make sure that the server's EFI config boots on sdb2:

  • changed /etc/fstab to make sure that /dev/sda1 is used as /boot/efi (needed because of the confusion of UUID since i cloned yesterday with dd)
  • mount manually /dev/sda1 as /boot/efi
  • ran grub-install: hopefully the server will eventually reboot on sdb2.

~~~

So, in short, this new setup should prevent the need for replacement of hardware.

4 hard drives:

  • sdc (size: 2TB) has shown a worrying error: 1 bad sector. The rest (backup space) seems alright.
  • sdb: (1TB): used for the dom0, seems all OK.
  • sda and sdd: RAID for all application, data and VMs: all OK.

~~~

Marking this ticket as resolved. Will follow up eventually if there's an issue when the server restarts.

Also available in: Atom PDF