Panic! Terror! Aaaghh!

El dimecres passat vaig rebre un email del daemon que controla el raid 1 de dos discs de 120 Gb que tinc montat com a /home del servidor de casa, on està allotjat aquest blog, informant-me de que un dels discs havia fallat:

From: 	mdadm monitoring
Subject: 	Fail event on /dev/md0:s0
Date: 	Wed, 27 Jul 2005 23:12:17 +0200

This is an automatically generated mail message from mdadm
running on s0

A Fail event had been detected on md device /dev/md0.


Faithfully yours, etc.

Fins avui no havia tingut temps de mirar-m'ho amb calma... i sorpresa la mia, estic funcionant només amb un disc!! Això vol dir que si peta el disc, em quedo sense totes les dades que tinc al /home... ara sortiré corrents cap a Rda. St. Antoni a comprar un disc, aixi que el servidor estarà parat aquesta nit mentre el canvii... aprofitaré el downtime per posar-li també una unitat de CD (ara no en té) i una altra X100P per connectar-hi un trunk.

Aquí deixo constància dels passos per comprovar el raid.... com es pot veure hi ha un disk que falla (/dev/hdg):

s0 ~ # lsraid -D -l -a /dev/md0
[dev 33, 0] /dev/hde:
        md version              = 0.90.0
        superblock uuid         = 2484EF77.198495F3.E0BD280C.A13A89EF
        md minor number         = 0
        created                 = 1068606386 (Wed Nov 12 04:06:26 2003)
        last updated            = 1123003057 (Tue Aug  2 19:17:37 2005)
        raid level              = 1
        chunk size              = 4 KB
        apparent disk size      = 120060800 KB
        disks in array          = 1
        required disks          = 2
        active disks            = 1
        working disks           = 1
        failed disks            = 1
        spare disks             = 0
        position in disk list   = 0
        position in md device   = 0
        state                   = good

s0 ~ # lsraid -D -a /dev/md0 -d /dev/hdg
[dev 33, 0] /dev/hde:
        md device       = [dev 9, 0] /dev/md0
        md uuid         = 2484EF77.198495F3.E0BD280C.A13A89EF
        state           = good

[dev 34, 0] /dev/hdg:
        old md device   = [dev 9, 0]
        old md uuid     = 2484EF77.198495F3.E0BD280C.A13A89EF
        state           = unknown

Update: He reiniciat el server sense cambiar cap disc, i he vist que el raid arrancava només amb un disc (/dev/hde), però l'altre disc l'ha detectat correctament la BIOS y el sistema operatiu i el mdadm m'ha informat de que el raid està degradat, es a dir, la informació dels dos discs no està sincronitzada:

s0 ~ # mdadm --detail /dev/md0
/dev/md0:
        Version : 00.90.01
  Creation Time : Wed Nov 12 04:06:26 2003
     Raid Level : raid1
     Array Size : 120060800 (114.50 GiB 122.94 GB)
    Device Size : 120060800 (114.50 GiB 122.94 GB)
   Raid Devices : 2
  Total Devices : 1
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Tue Aug  2 21:17:24 2005
          State : clean, degraded
 Active Devices : 1
Working Devices : 1
 Failed Devices : 0
  Spare Devices : 0

           UUID : 2484ef77:198495f3:e0bd280c:a13a89ef
         Events : 0.16568598

    Number   Major   Minor   RaidDevice State
       0      33        0        0      active sync   /dev/hde
       1       0        0        -      removed

Així que com aparentment el disc no està cascat, he forçat la reconstrucció del raid:

s0 ~ # raidhotadd /dev/md0 /dev/hdg

s0 ~ # dmesg
md: bind<hdg>
RAID1 conf printout:
 --- wd:1 rd:2
 disk 0, wo:0, o:1, dev:hde
 disk 1, wo:1, o:1, dev:hdg
.<6>md: syncing RAID array md0
md: minimum _guaranteed_ reconstruction speed: 1000 KB/sec/disc.
md: using maximum available idle IO bandwith (but not more than 200000 KB/sec)
 for reconstruction.
md: using 128k window, over a total of 120060800 blocks.

s0 ~ # mdadm --detail /dev/md0
/dev/md0:
        Version : 00.90.01
  Creation Time : Wed Nov 12 04:06:26 2003
     Raid Level : raid1
     Array Size : 120060800 (114.50 GiB 122.94 GB)
    Device Size : 120060800 (114.50 GiB 122.94 GB)
   Raid Devices : 2
  Total Devices : 2
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Tue Aug  2 21:41:43 2005
          State : clean, degraded, recovering
 Active Devices : 1
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 1

 Rebuild Status : 12% complete

           UUID : 2484ef77:198495f3:e0bd280c:a13a89ef
         Events : 0.16569389

    Number   Major   Minor   RaidDevice State
       0      33        0        0      active sync   /dev/hde
       1       0        0        -      removed

       2      34        0        1      spare rebuilding   /dev/hdg

s0 ~ # lsraid -D -a /dev/md0 -d /dev/hdg
[dev 34, 0] /dev/hdg:
        md device       = [dev 9, 0] /dev/md0
        md uuid         = 2484EF77.198495F3.E0BD280C.A13A89EF
        state           = spare

[dev 33, 0] /dev/hde:
        md device       = [dev 9, 0] /dev/md0
        md uuid         = 2484EF77.198495F3.E0BD280C.A13A89EF
        state           = good

De moment porta un 12% i sembla que va tot bé... de tota manera he comprat un Maxtor de 200 Gb (7200 RPM 8Mb) per 82 euros (6 euros menys que el que em va costar fa gairebé un any el mateix disc), que igualment posaré al server només per fer backups :P

Update 2: El raid s'ha acabat de reconstruir correctament, suposo que només devía ser un fallo de sync i que el disc no està afectat.

s0 ~ # dmesg
md: md0: sync done.
RAID1 conf printout:
 ---
wd:2 rd:2 disk 0, wo:0, o:1, dev:hde disk 1, wo:0, o:1, dev:hdg s0 ~ # mdadm --detail /dev/md0 /dev/md0: Version : 00.90.01 Creation Time : Wed Nov 12 04:06:26 2003 Raid Level : raid1 Array Size : 120060800 (114.50 GiB 122.94 GB) Device Size : 120060800 (114.50 GiB 122.94 GB) Raid Devices : 2 Total Devices : 2 Preferred Minor : 0 Persistence : Superblock is persistent Update Time : Wed Aug 3 02:27:38 2005 State : clean Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0 UUID : 2484ef77:198495f3:e0bd280c:a13a89ef Events : 0.16578408 Number Major Minor RaidDevice State 0 33 0 0 active sync /dev/hde 1 34 0 1 active sync /dev/hdg s0 ~ # lsraid -D -l -a /dev/md0 [dev 33, 0] /dev/hde: md version = 0.90.0 superblock uuid = 2484EF77.198495F3.E0BD280C.A13A89EF md minor number = 0 created = 1068606386 (Wed Nov 12 04:06:26 2003) last updated = 1123028858 (Wed Aug 3 02:27:38 2005) raid level = 1 chunk size = 4 KB apparent disk size = 120060800 KB disks in array = 2 required disks = 2 active disks = 2 working disks = 2 failed disks = 0 spare disks = 0 position in disk list = 0 position in md device = 0 state = good [dev 34, 0] /dev/hdg: md version = 0.90.0 superblock uuid = 2484EF77.198495F3.E0BD280C.A13A89EF md minor number = 0 created = 1068606386 (Wed Nov 12 04:06:26 2003) last updated = 1123028928 (Wed Aug 3 02:28:48 2005) raid level = 1 chunk size = 4 KB apparent disk size = 120060800 KB disks in array = 2 required disks = 2 active disks = 2 working disks = 2 failed disks = 0 spare disks = 0 position in disk list = 1 position in md device = 1 state = good