Hi managers
Thank you to all that have replied:
  Donn Aiken <daiken_at_regents.edu>
  Whitney Latta <latta_at_decatl.alf.dec.com>
  Heater, Gene <Gene.Heater_at_echostar.com>
  Dr. Tom Blinn <tpb_at_doctor.zk3.dec.com>
Donn Aiken, who had a very similar problem give the folowing advice:
> 1.  Went to the latest public release of the firmware (V5.3 for our 4100).
> 2.  Updated the Alphabios to the same version (ours was out of sync).
> 3.  Updated the kzpsa firmware to the A11 version.
> 4.  Replaced the SCSI cable from our kzpsa controller to our raid array with 
      a better, shorter one (but not too short!).
> 5.  Updated the PALcode to the same version as was on the public release 
      (which was 1.21-26)
Doing most of these things I can now exclude trivial problems.  
Finally, I called Compaq H/W support.
Below the most detailed answer from Dr. Tom Blinn:
> Hardware.  The PALcode (Privileged Architecture Library) is what runs to deal
> with the really low level hardware events, such as interrupts, arithmetic
> traps, and so forth (including memory management page faults).  In each case
> it either processes the interrupt directly (e.g., it might handle a single 
> bit memory error by ignoring it if it's been told to do so, as long as the 
> error was corrected by the ECC code), or it reports the event to the kernel,
> through a transfer of control through a well-defined interface.
> When the PALcode sees a machine check (hardware fault), it's supposed to put
> a log frame (event logging) into memory and transfer control to the kernel, 
> and then the kernel logs the event in the error log and either panics the 
> system (if it's a fatal error) or returns control through the PALcode so that
> things keep running.  Either way, the PALcode is involved for hardware faults
> , such as a "machine check" (which is a class of hardware fault).
> In this case, a machine check occurred WHILE THE PALcode WAS ALREADY RUNNING.
> For instance, if while an interrupt is being serviced, a machine check occurs
> and you're still running in the PALcode, you'd see this error.
>
> Since the frequency is increasing that would suggest that whatever it is in
> the hardware that's failing is starting to fail more often.  Get the hardware
> repaired and this problem should go away.
>
> Tom
Uwe Siodlaczek
---------------------------------------------------------------------------
Original post:
Hello,
We have an AlphaStation 255/233 running Digital UNIX V4.0B (Rev. 564) WITH
Patch Kit-0008 ( Firmware revision: 6.9 PALcode: OSF version 1.46) which
halts with
 Halted CPU
 halt code = 7
 machine check while in PAL mode
 PC=19370.
The frequency which this happens seems to be increasing.
Hardware/Software Error? 
Any hints? Thanks in advance.
Received on Fri Apr 09 1999 - 14:06:21 NZST