I have been researching this via altavista and dejanews and it seems
like others are having trouble with drives dropping offline on 2100s,
but in general it seems to be RZ28/29 on RAID controllers related to
firmware problems.
I have a 2100 4/275 that had:
      _(DEC     RZ28     (C) DEC 442C) 
      _(SEAGATE ST15230N         0168) 
      _(SEAGATE ST15230N         0638) 
the 0168 Hawk was always dropping offline, requiring a powercycle to get
it functional again (pulling from the StorageWorks rack and plugging back
in).  The 0638 drive seems never to have dropped offline.  I had the 0168
drive firmware upgraded, but in the interim replaced it with a
      _(FUJITSU M2954S-512 0142) (7200RPM 4GB)
by connecting to the external SCSI connector.
Now, this FUJI drive is dropping offline and sometimes requiring a drive
case power-cycle to come back online (though it usually just requires a:
        scu -f /dev/rrz3c reset device
to bring it back from the dead.)
We also have another 2100 system that has a CONNER 4GB (4107) drive that is
continually going offline (with data corruption when it does)
Since the problem seems so widespread (not related to just a specific disk
type or machine) i'm wondering if it might be a DUNIX or SCSI controller
problem. (of course i could just have bad luck with bad disks, but i don't
think so)
Is anyone aware of any known problems and resolutions that might pertain
to the problems i'm seeing?   If it's overly optimistic characteristics
of the SCSI subsystem is there something i can do in the DDR database to
make it more lenient?
Here's a more in-depth uerf listing:
------------------------------------------------------------------------------
OPERATING SYSTEM                        DEC OSF/1 
SYSTEM ID                 x00060009     CPU TYPE:  DEC 2100 
                                        Digital UNIX V4.0A  (Rev. 464); Thu 
                                         _Jan  9 09:08:40 MST 1997  
                                        physical memory = 512.00 megabytes. 
                                        Firmware revision: 4.6 
                                        PALcode: OSF version 1.45 
                                        AlphaServer 2100 4/275 
                                        cpu 0 EV-45 4mb b-cache 
                                        cpu 1 EV-45 4mb b-cache 
                                        cpu 2 EV-45 4mb b-cache 
                                        cpu 3 EV-45 4mb b-cache 
                                        psiop0 at pci0 slot 1 
                                        Loading SIOP: script 1000e00, reg 
                                         _81000000, data 405a0de8 
                                        scsi0 at psiop0 slot 0 
                                        rz0 at scsi0 target 0 lun 0 (LID=0) 
                                         _(DEC     RZ28     (C) DEC 442C) 
                                        rz1 at scsi0 target 1 lun 0 (LID=1) 
                                         _(SEAGATE ST15230N         0638) 
                                        rz2 at scsi0 target 2 lun 0 (LID=2) 
                                         _(SEAGATE ST15230N         0638) 
                                        rz3 at scsi0 target 3 lun 0 (LID=3) 
                                         _(FUJITSU M2954S-512       0142) 
                                        rz5 at scsi0 target 5 lun 0 (LID=4) 
                                         _(DEC     RRD43   (C) DEC  1084) 
                                        rz6 at scsi0 target 6 lun 0 (LID=5) 
                                         _(DEC     RRD43   (C) DEC  1084) 
------------------------------------------------------------------------------
When the Fuji went offline, the following UERF message occured on a
        scu show edt lun 0
------------------------------------------------------------------------------
EVENT CLASS                             ERROR EVENT 
OS EVENT TYPE                  199.     CAM SCSI 
CLASS                         x0022     DEC SIM 
SUBSYSTEM                     x0000     DISK 
BUS #                         x0000
                              x0018     LUN x0
                                        TARGET x3
ROUTINE NAME                            as_finish 
                                        Autosense failed 
CAM ENTRY                 x0000040E     SIM_WS 
ERROR TYPE                              Soft Error Detected (recovered) 
------------------------------------------------------------------------------
This error message occured prior to that:
------------------------------------------------------------------------------
EVENT CLASS                             ERROR EVENT 
OS EVENT TYPE                  199.     CAM SCSI 
CLASS                         x0000     DISK 
SUBSYSTEM                     x0000     DISK 
BUS #                         x0000
                              x0018     LUN x0
                                        TARGET x3
ROUTINE NAME                            cdisk_complete 
                                        Cmd Timeout - retrying 
ERROR TYPE                              Soft Error Detected (recovered) 
DEVICE NAME                             FUJITSU M2954S-512      .M2954S-512 
                                        Active CCB at time of error 
                                        Command timed out 
ERROR - os_std, os_type = 11, std_type = 10
----- ENT_CCB_SCSIIO -----
*MY ADDR                  x1FE2B580
CCB LENGTH                    x00C0
FUNC CODE            x01
CAM_STATUS                    x000B     CAM_CMD_TIMEOUT 
PATH ID              0.
TARGET ID            3.
TARGET LUN           0.
CAM FLAGS                 x00000482
                                        CAM_QUEUE_ENABLE 
                                        CAM_DIR_OUT 
                                        CAM_SIM_QFRZDIS 
*PDRV_PTR                 x1FE2B228
*NEXT_CCB                 x00000000
*REQ_MAP                  x062CA400
VOID (*CAM_CBFCNP)()      x004811B0
*DATA_PTR                 xA07F4000
DXFER_LEN                 x00010000
*SENSE_PTR                x1FE2B250
SENSE_LEN            x40
CDB_LEN              x06
SGLIST_CNT                    x0000
CAM_SCSI_STATUS               x0000     SCSI_STAT_GOOD 
SENSE_RESID          x00
RESID                     x00010000
CAM_CDB_IO           x000000000000008090C9010A
CAM_TIMEOUT               x0000003C
MSGB_LEN                      x0000
VU_FLAGS                      x4000
TAG_ACTION           x20
------------------------------------------------------------------------------
--stephen
--
Stephen Dowdy - Systems Administrator - CS Dept - Univ of Colorado, Boulder
dowdy_at_cs.colorado.edu - 303-492-6196 - http://www.cs.colorado.edu/~dowdy/
"Team Spam Forever" (A division of Beatrice)    { NO cold Sales Calls !!! }
Received on Thu Jan 30 1997 - 19:26:36 NZDT