--
__/ __/ __/ __/ __/ __/ __/ __/ __/ __/ __/ __/ __/ __/ __/ __/
Jeffrey G. Micono 505.844.6767
Ktech Corporation 505.268.3379
__/ __/ __/ __/ __/ __/ __/ __/ __/ __/ __/ __/ __/ __/ __/ __/
From: Kurt Carlson <sxkac_at_java.sois.alaska.edu>
>Apr 8 15:28:21 net vmunix: WARNING: too many Processor corrected errors
>detected on cpu 0. Reporting suspended.
>Apr 8 15:28:21 net vmunix: Machine Check error corrected by processor
>Apr 8 15:32:44 net vmunix: fffd4
>Apr 8 15:32:44 net vmunix: panic (cpu 0): Machine check - Hardware error
You have a hardware problem, probably memory or cpu.
Check your errorlog... with uerf you'll see CPU exceptions, with
decevent you may be able to tell exactly what.
>Could errant software cause this
no
>or should I be looking for loose chips, simms or cards?
reseating cards might help, might not.
you need to find out what's having the problem... check the
error logs.
_____________________________________________________________________
Kurt Carlson, University of Alaska SOIS/TS, (907)474-6266
sxkac_at_alaska.edu 910 Yukon Drive #105.63, Fairbanks, AK 99775-6200
From: Dave Cherkus <cherkus_at_homerun.unimaster.com>
Blue Moon Network Administrator writes:
|>
|> This has happened to me twice now.
|>
|> the machine has been running with no problems for months. Solid as a rock.
|>
|> Now I get this twice in a row within 60 minutes of each other:
|>
|> Apr 8 15:28:21 net vmunix: WARNING: too many Processor corrected errors
|> detected on cpu 0. Reporting suspended.
|> Apr 8 15:28:21 net vmunix: Machine Check error corrected by processor
|> Apr 8 15:32:44 net vmunix: fffd4
|> Apr 8 15:32:44 net vmunix: panic (cpu 0): Machine check - Hardware error
|>
|> Could errant software cause this or should I be looking for loose chips, simms
|> or cards? The last thing we did was hack wuftpd to not allow logins for users
|> with the shell /bin/nologin which is for a POP only type of account.
This is definitely a hardware problem.
|> The machine and cables haven't been moved at all in a while and nothing major
|> has changed since it was stable.
|>
|> I do have the messages file with all the addresses, but our service contract
|> with those thieves at DEC has expired and the dump info is just so much
|> gibberish.
Well, those theives might also suggest you install a program called
DEC-Event that at least tries to decode the gibberish for you.
|> Do I have a definate hardware fault here or can software ellicit such a crash?
|> Dust in the case of the 1000? Loose simms? Dirty card edges? Power is
|> conditioned through a UPS.
Hardware. The cpu can correct errors in the cache chips or the memory
simms. Try reseating the card that the CPU is on and memory chips.
Don't try reseating the CPU itself - you will almost certainly bend a
pin and ruin the CPU.
I hate to say but I have had problems with this pattern and did have
to have the CPU card replaced.
--
Dave Cherkus ------- UniMaster, Inc. ------ Contract Software Development
Specialties: UNIX Internals/Kernel TCP/IP Alpha Clusters Performance ISDN
Email: cherkus_at_UniMaster.COM When the music's over, turn out the lights!
From: Olle Eriksson <olle_at_cb.uu.se>
You have a bad memory card or a bad cache memory.
From: "Knut =?iso-8859-1?Q?Helleb=F8?=" <Knut.Hellebo_at_nho.hydro.com>
Regards,
Try shutting down to PROM mode and do 'set d_group field;memory' to test
the memory. If anything fails do 'showit' to get the status from the
memory test (and hopefully the failing SIMM(s)). To interrupt the test
you have to hard reset. Good Luck ;-)
-- =
******************************************************************
* Knut Helleb=F8 | DAMN GOOD COFFEE !! =
*
* Norsk Hydro a.s | (and hot too) *
* Phone: +47 55 996870, Fax: +47 55 996342 | *
* Cellular Phone: +47 93092402 | *
* E-mail: Knut.Hellebo_at_nho.hydro.com | Dale Cooper, FBI *
******************************************************************
From: TetraPakDA_at_t-online.de (Tetra Pak APS GmbH)
Hi,
I'd got a similar problem on my Alpha Server.
Changing the simms solved it (we had DEC and third party simms)
In console mode init the system, if it says something
about corretable error, try and excange the simms.
Best regards
Claudia
From: "Dr. Tom Blinn, 603-881-0646" <tpb_at_zk3.dec.com>
> Now I get this twice in a row within 60 minutes of each other:
>
> Apr 8 15:28:21 net vmunix: WARNING: too many Processor corrected errors
> detected on cpu 0. Reporting suspended.
> Apr 8 15:28:21 net vmunix: Machine Check error corrected by processor
> Apr 8 15:32:44 net vmunix: fffd4
> Apr 8 15:32:44 net vmunix: panic (cpu 0): Machine check - Hardware error
>
> Could errant software cause this or should I be looking for loose chips, simms
> or cards? The last thing we did was hack wuftpd to not allow logins for users
> with the shell /bin/nologin which is for a POP only type of account.
I passed your message along to one of the engineers who works with the
platform you are seeing the problem on, and he pointed out some things
you might not be aware of.
One class of processor correctable errors is single-bit memory errors.
These are corrected by the ECC logic, but an event is logged through
the "Machine Check" logic (an interface between the PALcode that runs
in interrupt mode to deal with hardware problems) and the operating
system. When there are large numbers of Processor corrected errors
in a short period, you get the message about reporting being turned
off (so the error logs don't grow without bound, but you'll know we
stopped logging the errors).
A double bit error would NOT be corrected, and would panic the system,
perhaps with a "Machine check - Hardware error".
You need to run UERF or DECevent (see the reference pages) against the
binary error log (which gets updated after the panic during the reboot
with the data from the hardware logout frame) and get the detailed log
of what made the system fail. (Unfortunately, on your system, there
is no UERF support, so you have to use DECevent.)
Once you have that information, it's possible to tell exactly what went
wrong.
If you have a hardware support contract, I'd recommend you call this in
and ask that your system be repaired, because something's broken in the
hardware. If you are self-maintenance, then you need to do the analysis
of the error log yourself and decide what components to replace.
Tom
Dr. Thomas P. Blinn, UNIX Software Group, Digital Equipment Corporation
110 Spit Brook Road, MS ZKO3-2/U20 Nashua, New Hampshire 03062-2698
Technology Partnership Engineering Phone: (603) 881-0646
Internet: tpb_at_zk3.dec.com Digital's Easynet: alpha::tpb
ACM Member: tpblinn_at_acm.org PC_at_Home: tom_at_felines.mv.net
Worry kills more people than work because more people worry than work.
Keep your stick on the ice. -- Steve Smith ("Red Green")
My favorite palindrome is: Satan, oscillate my metallic sonatas.
-- Phil Agre, pagre_at_ucsd.edu
Opinions expressed herein are my own, and do not necessarily represent
those of my employer or anyone else, living or dead, real or imagined.
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
Received on Fri Apr 11 1997 - 22:50:55 NZST
This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:36 NZDT