---
These are caused by fixable bit errors in memory. AlphaStations use ECC
(Error Correcting Code) memory, where a few extra bits in each memory
word are used to provide enough parity information to allow repair of
single-bit errors and detection of double-bit errors in a word of
memory.
If these are very intermittent, then you probably have little to worry
about; the messages above indicate that a single-bit error was
successfully corrected so there was no data corruption. If these are
very common or the messages indicate uncorrected errors, you may want to
replace the memory in the machine. Unfortunately we haven't found out
how to trace ECC errors down to a particular memory module, so you may
need to replace all the memory or do several changeouts with different
sets of modules to eliminate a single bad module.
Steve VanDevender <stevev_at_hexadecimal.uoregon.edu>
---
Hi,
>From the looks of it, an error occured in your main memory but was
corrected when the dynamic cache was getting filled, i.e., data wasbeing
buffered in it.
If this is an isolated incident, it could be nothing serious. However,
if
it has happened before, it could be indications that your memory chips
are
going bad. I'll ask around and get you more specific info on it.
Hope this helps.
Santosh
Santosh Krishnan x2815 <santosh_at_heplinux1.uta.edu>
---
Sounds like a single-bit memory error that the ECC memory (error
correcting)
fixed. This is generally a problem with physical memory. I've seen these
result from a misaligned memory SIMM and broken SIMMS. If you have
hardware
support, call DEC and feed them the info you gave us.
Good Luck.
Ed Jones Internet email: EJONES16_at_ford.com
CAD/CAM/PIM Internal email: ejones16_at_cadcam.pms.ford.com
313-845-6068 B220 Suite 100 ALPHA
---
Hi,
This is a non-fatal memory error (radio-electrical perturbation,
etc...)
detected and corrected by the cpu.
It is a "normal" error message unless you get a lot of them, in
this
case, may be a bank of memory will become out of order.
Patrice.
/############################################################################\
# Patrice
LEGOUX #
#------------------------------------+---------------------------------------#
# ADP/Gsi Mini Services | Decnet :
GSUV09::LEGOUX #
# 4 Rue Sentou
+---------------------------------------#
# 92150 SURESNES | Tel :
+33(0)14625.5054 #
# France | Fax :
+33(0)147.72.04.99 #
#
+---------------------------------------#
# E-Mail :
Patrice.Legoux_at_gsi.fr #
# X400 : C=FR; ADMD=ATLAS; PRMD=GSI; O=GSI; S=LEGOUX;
G=PATRICE #
# Memo :
LEGOUX #
\############################################################################/
---
It looks like you may have some memory that is going bad. If you have
a support contract, talk to DEC and see if you can get them to give you
a PAK for DECEvent (it should be free if you are on support). DECEvent
will let you trace failing memory down to the SIMM level.
Am not sure if your errror is coming from system memory or the L2 cache.
In any event, DECEvent should help you trace it down. If you are
already
using DECEvent, you error log looks pretty detailed for uerf, look at
the
full listings and it should give you something like this:
----- snip ----- snip ----- snip ----- snip ----- snip -----
******************************** ENTRY 113
********************************
Logging OS 2. Digital UNIX
System Architecture 2. Alpha
Event sequence number 5.
Timestamp of occurrence 01-OCT-1996 20:42:27
Host name ssdeng
System type register x0000000C AlphaServer 8x00
Number of CPUs (mpnum) x00000004
CPU logging event (mperr) x00000000
Event validity 1. O/S claims event is valid
Event severity 5. Low Priority
Entry type 100. CPU Machine Check Errors
CPU Minor class 4. 620 System Correctable Error
--TLaser 620 Corr Error--
Software Flags x00000001 TLSB Error Log Snapshot Packet
Present
Active CPUs x0000000F
Hardware Rev x00000000
System Serial Number ni54600czq
Module Serial Number AY55025362
System Revision x00000000
MCHK Reason Mask x00000086
MCHK Frame Rev x00000000
EI STAT xFFFFFFF0C4FFFFFF
DATA SOURCE IS MEMORY OR SYSTEM
CORRECTABLE ECC ERROR
D-ref fill
EV5 Chip Rev 4
EI ADDRESS xFFFFFF003300CF7F
FILL SYNDROME x0000000000000029
Data Bit = 011
ISR x0000000100100000
Ext. HW interrupt at IPL20
Correctable ECC errors (IPL31)
AST requests 3 - 0
x0000000000000000
WHAMI x00 TLSB NODE ID 0.
CPU0
MISCR x55 B-Cache Size 4 Mbyte Bcache
Two Processors
TLSB RUN Signal
CPU0 Running console
TLDEV x51008014 Device Type Turbo-Laser Dual CPU,
4meg
Bcache
Device Rev x00005100
TLBER x00440000 CORRECTABLE READ DATA ERROR
DATA SYNDROME 2
TLESR0 x00405400
TLESR1 x00400C0C
TLESR2 x00602900 ECC Syndrome 0 x00000000
ECC Syndrome 1 x00000029
CORRECTABLE READ ECC ERROR
Error Syndrome 0 x00 No Error
Error Syndrome 1 x29 Data Bit = 139
TLESR3 x00409090
Palcode Revision x0000000600000301
Palcode Rev: 3.1-1
*TLaser CPU Registers*
TLSB Node Number 0.
TLDEV x8014 Turbo-Laser Dual CPU, 4meg Bcache
TLBER x00440000 CORRECTABLE READ DATA ERROR
DATA SYNDROME 2
TLCNR x00000200
TLVID x00000010
TLESR0 x00405400
TLESR1 x00400C0C
TLESR2 x00602900 ECC Syndrome 0 x00000000
ECC Syndrome 1 x00000029
CORRECTABLE READ ECC ERROR
TLESR3 x00409090
TLEPAERR x00000000
MODCONFIG x00098AD4 Lockout Enable
Command Piping To EV5 Disabled
Bcache Size: 4 MB
Bcache Idle Cycles Before 11.
Max Command Queue Entries 2.
Max Bus Queue Entries 4.
TLEPMERR x00000000
TLEPDERR x00000000
TLEP Interrupt Mask 0 x000000FE IPL 14 Interrupt Enable
IPL 15 Interrupt Enable
IPL 16 Interrupt Enable
IPL 17 Interrupt Enable
Interprocessor Interrupt Enable
Interval Timer Interrupt Enable
CPU Halt Enable
TLEP Interrupt Summary 0 x00000040 Interval Timer Interrupt
Outstanding
TLEP Interrupt Mask 1 x00000000
TLEP Interrupt Summary 1 x00000000
*TLaser CPU Registers*
TLSB Node Number 1.
TLDEV x8014 Turbo-Laser Dual CPU, 4meg Bcache
TLBER x00800000
TLCNR x00000210
TLVID x00000032
TLESR0 x00000303
TLESR1 x00000303
TLESR2 x00000303
TLESR3 x00000303
TLEPAERR x00000000
MODCONFIG x00098AD4 Lockout Enable
Command Piping To EV5 Disabled
Bcache Size: 4 MB
Bcache Idle Cycles Before 11.
Max Command Queue Entries 2.
Max Bus Queue Entries 4.
TLEPMERR x00000000
TLEPDERR x00000000
TLEP Interrupt Mask 0 x000000FE IPL 14 Interrupt Enable
IPL 15 Interrupt Enable
IPL 16 Interrupt Enable
IPL 17 Interrupt Enable
Interprocessor Interrupt Enable
Interval Timer Interrupt Enable
CPU Halt Enable
TLEP Interrupt Summary 0 x00000000
TLEP Interrupt Mask 1 x00000000
TLEP Interrupt Summary 1 x00000000
* TLaser Memory Regs *
TLSB Node Number 7.
TLDEV x5000 Turbo-Laser Memory Module
TLBER x01440000 CORRECTABLE READ DATA ERROR
DATA SYNDROME 2
DATA TRANSMITTER DURING ERROR
TLCNR x000FC270
TLVID x00000080
FADR x078200003300CF40
FADR 1 x07820000 Failing Command: Read
Failing Bank = Bank 8
TLESR0 x00005400
TLESR1 x00000C0C
TLESR2 x00212900 ECC Syndrome 0 x00000000
ECC Syndrome 1 x00000029
TRANSMITTER DURING ERROR
CORRECTABLE READ ECC ERROR
ECC Code x00
Second ECC Code x29 Failing SIMM Number = J17
TLESR3 x00009090
TMIR x80000001 Interleave x00000001
TMCR x0000023D 2GB Module (E2036-AA)
16 MB
70ns DRAM
Strings Installed = 8
DRAM timing: Bus Spd = 13.0-15.0;
Refresh Cnt = 1008
TMER x00000005 Failing String = x00000005
TMDRA x00000000 Refresh Rate 1X
TDDR0 x00000000
TDDR1 x00000000
TDDR2 x00000000
TDDR3 x00000000
* TLaser I/O Registers *
TLSB Node Number 8.
TLDEV x2020 Turbo-Laser Integrated I/O Module
TLBER x00000000
FADR 0 x0000000000000000
FADR 1 x00000000
TLESR0 x00000000
TLESR1 x00000000
TLESR2 x00000000
TLESR3 x00000000
CPU Interrupt Mask x00000001 Cpu Interrupt Mask = x00000001
ICCMSR x00000000 Arbitration Control Minimum
Latency Mode
Supress Control Suppress after 16
Transations
ICCNSE x80000000 Interrupt Enable on NSES Set
ICCMTR x00000000
IDPNSE-0 x00000006 Hose Power OK
Hose Cable OK
IDPNSE-1 x00000006 Hose Power OK
Hose Cable OK
IDPNSE-2 x00000000
IDPNSE-3 x00000000
IDPVR x00000800
ICCWTR x00000000
TLMBPR x0000000000000000
IDPDR0 x20000000
IDPDR1 x20000000
IDPDR2 x00000000
IDPDR3 x00000000
----- snip ----- snip ----- snip ----- snip ----- snip -----
As you can see from the "TLaser Memory Regs" section, it will specify
the
memory down to the module, bank, and SIMM. Some other things to keep in
mind
are:
(1) There will be a certain noise level of ECC recovered errors, even
with
perfectly good memory. The causes have to do with things like
cosmic
rays, naturally occuring isotopes in the memory losing neutrons
at high speed, EMF, power spikes, and other perfectly normal reasons
why
bits may get twiddled when they shouldn't. That is why we pay the
extra
$$ for ECC RAM. I wouldn't get worried unless you start to see a
lot of
errors from one or two SIMMs.
(2) DECEvent (and uerf) seem to belive that these are not memory errors,
but
are instead CPU errors and you must extract them as such. Here is
the
command line I used to pull the entry above out:
dia -icpu -R -o full > /tmp/delme
Please note I haven't said anything about using DECEvent's
auto-diagnostic
features. That's because it managed to miss a failing simm on our
8400
that had a couple of hundred errors logged on it. DEC is still
trying to
figure out why it didn't recognize it.
Hope this helps,
Tom
--
+--------------------------------+------------------------------+
| Tom Webster | "Funny, I've never seen it |
| SysAdmin MDA-SSD ISS-IS-HB-S&O | do THAT before...." |
| webster_at_ssdpdc.mdc.com | - Any user support person |
+--------------------------------+------------------------------+
| Unless clearly stated otherwise all opinions are my own. |
+---------------------------------------------------------------+
---
You've got an self-correcting parity error on one (or many) of you
memory
chips. I suppose you got that output from uerf. It's not dangerous as
these
chips are equipped with single bit error correction hardware. This could
worsen if you have 2 bad bits on a memory cell, in that case, the
processor
will not be able to correct the error and the system would (probably)
crash.
If that kind of error persists, contact DEC and have them replace the
faulting memory board if it's under guarantee (those beasts are NOT
CHEAP).
We once filled the binary.errlog file with that kind of message. The
binary.errlog file filled to 72 Mb in 10 minutes ! Result: File system
full,
system crawling to its knees.... We cleared the log and got no problems
for
a month or so... We regularly checked the uerf log and the problem
resurfaced a month later, we had DEC replace the memory board.
Also, sometimes uerf (incorrectly) report that kind of error as a CPU
EXCEPTION due to a bug in the uerf binary.errlog parsing. This should be
fixed in revisions later than 3.2D-1.
HTH
Guy Dallaire
dallaire_at_total.net
"God only knows if god exists"
---
Hi;
I recieved the exact same error on the exact same machine type
DIGITAL said it was a correctable simm error and
to watch for repeats then worry.
"replace the sim"
I only see two in uerf
Scot
Scot <scot_at_engrs.infi.net>
---
It means you've got a bad memory chip that the ECC circuits are
compensating for. You need to get the board or chip replaced.
regards,
Ross
--
Ross Alexander, ve6pdq -- (403) 675 6311 -- rwa_at_cs.athabascau.ca
---
It was a message about a fixed memory error. If you got one, it's
probably no problem. If you tens a day, you should look into it.
ECC memory corrects single bit failures and detects double, but cannot
fix them. So if you get lots of these messages you should probably
replace your memory.
Harald Lundberg <hl_at_tekla.fi>;Tekla Oy,Koronakatu
1,FIN-02210,ESPOO,FINLAND
tel
+358-{9-8879449work,9-8039489fax,9-8026752,19-2418013res,50-5578303mob)
---
--
Nicci Roth
OTA Limited Partnership
1 Manhattanville Rd.
Purchase, NY 10577
phone: 914/694 5800
fax: 914/694 5831
email: nicci_at_ox.com
If you can't find any other meaning in everything that's
happening, try to consider it as entertainment.
Received on Tue Dec 03 1996 - 19:55:13 NZDT
This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:47 NZDT