Thanks to all who replied, including Dr. Thomas Blinn, Brian Staab,
Alan Rollow, Juan Ramon, Charles Ballowe, Jenny Butler, and any others
that continue to come in.
I neglected to mention this was a dual processor ES40 in my posting.
The good Dr. Blinn said:
>The key message in all that gobbledegook is this:
>> vmunix: CPU 1 is prevented from being rebooted.
>> vmunix: The system must be reset or power cycled to clear this 
>>state.
>> vmunix: panic (cpu 1): Processor Machine Check
>This is NOT a software problem, you won't learn anything useful from 
>kdbx (that you won't find in the crash-data file in /var/adm/crash), 
>you have broken hardware; this was a hard fault on one CPU (I can't 
>say from the output what system model), you need to get the hardware 
>repaired.
>Too bad you don't have a support contract..  The repair may turn out 
>to be quite pricey on "time and materials".
He's preaching to the choir on the support contract issue! The system
ended up going down 5 times over the weekend, even after the power
off reboot. I ended up coming in last night, did a cluster power
down, rebooted each node as the other (2 booted as 1 and 1 booted
as 2) and everything has been okay since (except that node 2 is
now running on suspected bad hardware). 
Alan Rollow offered:
That it was a machine check points strongly at a hardware
        problem.  Your contract would probably have gotten you a
        version of WEBES/CA that had analysis rules for the
        system.  If you have that installed, it may be able
        to offer a clue.  Having the WEBES/CA kit is sufficient
        to use it; analysis doesn't require a license.
I'll persue this further to see if it tells me which cpu failed,
I just happen to have a couple of spares :-) and I could swap
the bad/suspect one out at our next *scheduled* outage.
Thanks to everybody that replied, and, not intended to slight 
anybody, but WOW, aren't the two individuals quoted above sharp!
Thanks.  Jeff
-----Original Message-----
From: Jeff Beck 
Sent: Monday, October 20, 2003 12:07 PM
To: Managers List Alpha (tru64-unix-managers_at_ornl.gov)
Subject: System panics
Help! I was under a Silver support contract for years (until new upper
management decided not to renew it in September--don't get me started on
THAT subject) and now I've had 2 system panics within 36 hours. Can anybody
shed any light on what may be wrong (i.e. some piece of hardware about to
die)? This is one node of a 2 node cluster, 5.1a, pk3. Here's a portion of
/var/adm/messages:
vmunix: Machine Check Processor Fatal Abort
vmunix: Machine check code = 0x100000098
vmunix:     Ibox Status                             = 0000000000000000
vmunix:     Dcache Status                           = 0000000000000008
vmunix:     Cbox Address                            = 0000000028d151c0
vmunix:     Fill Syndrome 1                         = 0000000000000016
vmunix:     Fill Syndrome 0                         = 000000000000001f
vmunix:     Cbox Status                             = 0000000000000010
vmunix:     EV6 captured status of Bcache mode      = 0000000000000000
vmunix:     EV6 Exception Address                   = 00000300020a1008
vmunix:     EV6 Interrupt Enablement and Current Processor mode =
0000007ee0000008
vmunix:     EV6 Interrupt Summary Register          = 0000000080000000
vmunix:     EV6 TBmiss or Fault status              = 0000000000000290
vmunix:     EV6 PAL Base Address                    = 0000000000018000
vmunix:     EV6 Ibox control                        = fffffe0006304396
vmunix:     EV6 Ibox Process_context                = 0000460000000004
vmunix:     O/S Summary flag                        = 0000000000000004
vmunix:     Cchip Base Address (phys)               = 00000f01a0000000
vmunix:     Cchip Device Raw Interrupt Request      = 0000000000000000
vmunix:         DRIR Register Decode:
vmunix:             PCI Device Interrupt Mask       = 0000000000000000
vmunix:     Cchip Miscellaneous Register            = 0000000000000000
vmunix:         Misc Register Decode:
vmunix:             Cchip Revision: 00
vmunix:             ID of CPU performing read: 00
vmunix:     Pchip 0 Base Address (phys)             = 00000f0180000000
vmunix:     Pchip 0 Error Register                  = 0000000000000000
vmunix:         Pchip Error Register Decode:
vmunix:             PCI Xaction Start Address       = 0000000000000000
vmunix:             PCI Command: Interrupt Acknowledge
vmunix:     Pchip 1 Base Address (phys)             = 00000f0380000000
vmunix:     Pchip 1 Error Register                  = 0000000000000000
vmunix:         Pchip Error Register Decode:
vmunix:             PCI Xaction Start Address       = 0000000000000000
vmunix:             PCI Command: Interrupt Acknowledge
vmunix: CPU 1 is prevented from being rebooted.
vmunix: The system must be reset or power cycled to clear this state.
vmunix: panic (cpu 1): Processor Machine Check
vmunix: syncing disks...
Alternatively, if anybody know how to use kdbx and could tell me what to
look for with it, that would help also--I've got crash dump files. Thanks.
Jeff
Jeff Beck
jbeck_at_carewiseinc.com
206.749.1878
SHPS Healthcare Services Seattle Operations
1501 4th Ave.
Suite 700
Seattle, WA   98101-1629
Received on Mon Oct 20 2003 - 23:03:00 NZDT