The problem : 
 > 
 > I have an HSZ40 cluster and i had a serious I/O problem with my
 > database (Informix) in a defined chunk (rza8f partition). I found
this
 > message in my /var/adm/syslog.dated :
 > 
 > daemon.log:Sep 25 16:00:10 UXfinanceiro DECsafe: UXfinanceiro Agent
 > ***ALERT: hard device error on /dev/rza8f from
 > UXfinanceiro.supermar.com.br 
 > 
 > The message points to a hardware error in the volume rza8 that is a
 > logical volume with 6 rz29b 4.3 GB disks, grouped in a raidset (RAID
 > 5), so i can't know in which physical device the error ocurred. The
 > cluster only wrotes the log message below, and didn't identify the
 > device :
 > Instance Code: 01010302 
 > Description: An unrecoverable hardware detected fault occurred.
 > Reporting Component: 1.(01)
 > Description:Executive Services
 > Reporting component's event number: 1.(01)
 > Event Threshold: 2.(02)
 > Classification: HARD. Failure of a component that affects controller
 > performance or precludes access to a device connected to the
 > controller is indicated. Last Failure Code: 018800A0 (No Last Failure
 > Parameters) Last Failure Code: 018800A0 Description:A processor
 > interrupt was generated with an indication that the program card was
 > removed.   
 > My immediate solution was don't use the partition rza8f in the
 > database. But i'm loosing  2Gb (the size of rza8f) and i still can't
 > identify the physical device with problem.
I couldīt find any message that identify the phisycal device with the
hardware error. I tried all the logs, uerf, HSZ40 FMU(Fault Manager
Utility), hszterm commands : show disks, show failedsets, show
<everything>, and the HSZ40 didnīt assign any device with error (
flashing the error led ).
I spoke with DEC Support (Hardware/Software) and we checked the firmware
of the cluster, disks(rz29b-va), all my definitions and everything
else...with no results.
I had to use the DECevent Software (translator module), with this
command:
   dia -t s:25-sep-1997:08:00:00 e:25-sep-1997:14:00:00
output bellow:
RAIDSET State                   x00  NORMAL. All members present and
                                     reconstructed, IF LUN is
configured                                      as a RAIDSET.
Error Count                       1.
Retry Count                       0.
Most Recent ASC                 x80
Most Recent ASCQ                x00
Next Most Recent ASC            x00
Next Most Recent ASCQ           x00
Device Locator              x000003  Port    =   3.
                                     Target  =   0.
                                     LUN     =   0.
Command Opcode                  x28  Read (10 byte)
Original CDB                                                           
--------------------------------------------------------------
With this information i did:
CLI> locate ptl 3 0 0 
So we replaced the disk and the problem disapear.
We really donīt know why the cluster didnīt assign the device with
error, but any aditional information i will summarize again.
Thanks to all who helped me out!
and Sorry my poor english.       
Alberto Camardelli
camardel_at_svn.com.br
Received on Thu Oct 09 1997 - 16:56:33 NZDT