[Summary]: What do these messages mean from Nicci Roth on 1996-12-04 (tru64-unix-managers)

From: Nicci Roth <nicci_at_ox.com>
Date: Tue, 03 Dec 1996 12:21:50 -0500
Hi all,

Sorry about the delay in this summary but here's as much as I know.
Overall concensus is that I'm experiencing a main memeory problem. In
short if you see this error once in a while you're OK, however, I've
been receiving the message anywhere from 13 to 974 times a day. So I
called Digital Service and they're on the case.

By the way you can use UERF to get a count on the error and it's type
and the Digital service person told me about this tool called DEC Event
which should translate all the messages to a readable format. I haven't
tried it yet but I think I'm going to test it out.

Thanks to all who replied, the responses are below:

Nicci

-----cut here

Hi !

Your main memory has experienced either a glitch (in this case
you will not see the message anymore), or a permanent damage
in one of its bits (you will receive this message every time
the damaged memory location is referenced). From the physical
address and the bit, a field service specialist should (;-) be
able to pinpoint which memory board has the problem. Anyway,
the ECC (error correcting code) logic in the CPU is taking
care of the error (thanks to redudant chips in the memory board),
so your data is not compromised. But sooner or later you may want
to replace the damaged board.

Good luck & regards,
Miguel Fliguer
Buenos Aires, Argentina
m_fliguer_at_scomp1.sonda.cl
---
These are caused by fixable bit errors in memory.  AlphaStations use ECC
(Error Correcting Code) memory, where a few extra bits in each memory
word are used to provide enough parity information to allow repair of
single-bit errors and detection of double-bit errors in a word of
memory.
If these are very intermittent, then you probably have little to worry
about; the messages above indicate that a single-bit error was
successfully corrected so there was no data corruption.  If these are
very common or the messages indicate uncorrected errors, you may want to
replace the memory in the machine.  Unfortunately we haven't found out
how to trace ECC errors down to a particular memory module, so you may
need to replace all the memory or do several changeouts with different
sets of modules to eliminate a single bad module.
Steve VanDevender <stevev_at_hexadecimal.uoregon.edu>
---
Hi,
>From the looks of it, an error occured in your main memory but was
corrected when the dynamic cache was getting filled, i.e., data wasbeing
buffered in it.
If this is an isolated incident, it could be nothing serious.  However,
if
it has happened before, it could be indications that your memory chips
are
going bad.  I'll ask around and get you more specific info on it.
Hope this helps.
Santosh
Santosh Krishnan x2815 <santosh_at_heplinux1.uta.edu>
---
Sounds like a single-bit memory error that the ECC memory (error
correcting)
fixed. This is generally a problem with physical memory. I've seen these
result from a misaligned memory SIMM and broken SIMMS. If you have
hardware
support, call DEC and feed them the info you gave us.
Good Luck.
Ed Jones        Internet email:  EJONES16_at_ford.com
CAD/CAM/PIM     Internal email:  ejones16_at_cadcam.pms.ford.com
313-845-6068    B220 Suite 100 ALPHA
---
Hi,
        This is a non-fatal memory error (radio-electrical perturbation,
etc...) 
detected and corrected by the cpu.
        It is a "normal" error message unless you get a lot of them, in
this 
case, may be a bank of memory will become out of order.
Patrice.
/############################################################################\
# Patrice
LEGOUX                                                             #
#------------------------------------+---------------------------------------#
# ADP/Gsi Mini Services              | Decnet :
GSUV09::LEGOUX               #
# 4 Rue Sentou                      
+---------------------------------------#
# 92150 SURESNES                     | Tel    :
+33(0)14625.5054             #
# France                             | Fax    :
+33(0)147.72.04.99           #
#                                   
+---------------------------------------#
# E-Mail :
Patrice.Legoux_at_gsi.fr                                             #
# X400   : C=FR; ADMD=ATLAS; PRMD=GSI; O=GSI; S=LEGOUX;
G=PATRICE            #
# Memo   :
LEGOUX                                                            #
\############################################################################/
---
It looks like you may have some memory that is going bad.  If you have
a support contract, talk to DEC and see if you can get them to give you
a PAK for DECEvent (it should be free if you are on support).  DECEvent
will let you trace failing memory down to the SIMM level.
Am not sure if your errror is coming from system memory or the L2 cache.
In any event, DECEvent should help you trace it down.  If you are
already
using DECEvent, you error log looks pretty detailed for uerf, look at
the
full listings and it should give you something like this:
----- snip ----- snip ----- snip ----- snip ----- snip -----
******************************** ENTRY  113
******************************** 
Logging OS                        2. Digital UNIX 
System Architecture               2. Alpha 
Event sequence number             5. 
Timestamp of occurrence              01-OCT-1996 20:42:27   
Host name                            ssdeng 
System type register      x0000000C  AlphaServer 8x00 
Number of CPUs (mpnum)    x00000004 
CPU logging event (mperr) x00000000 
Event validity                    1. O/S claims event is valid 
Event severity                    5. Low Priority 
Entry type                      100. CPU Machine Check Errors 
CPU Minor class                   4. 620 System Correctable Error 
--TLaser 620 Corr Error--              
Software Flags            x00000001  TLSB Error Log Snapshot Packet
Present 
Active CPUs               x0000000F 
Hardware Rev              x00000000 
System Serial Number                 ni54600czq 
Module Serial Number                 AY55025362 
System Revision           x00000000 
MCHK Reason Mask          x00000086 
MCHK Frame Rev            x00000000 
EI STAT                   xFFFFFFF0C4FFFFFF 
                                     DATA SOURCE IS MEMORY OR SYSTEM 
                                     CORRECTABLE ECC ERROR 
                                     D-ref fill 
                                     EV5 Chip Rev 4 
EI ADDRESS                xFFFFFF003300CF7F 
FILL SYNDROME             x0000000000000029 
                                     Data Bit = 011 
ISR                       x0000000100100000 
                                     Ext. HW interrupt at IPL20 
                                     Correctable ECC errors (IPL31) 
                                     AST requests 3 - 0 
x0000000000000000 
WHAMI                           x00  TLSB NODE ID  0. 
                                     CPU0 
MISCR                           x55  B-Cache Size  4 Mbyte Bcache 
                                     Two Processors 
                                     TLSB RUN Signal 
                                     CPU0 Running console 
TLDEV                     x51008014  Device Type  Turbo-Laser Dual CPU,
4meg 
                                                  Bcache 
                                     Device Rev  x00005100 
TLBER                     x00440000  CORRECTABLE READ DATA ERROR 
                                     DATA SYNDROME 2 
TLESR0                    x00405400 
TLESR1                    x00400C0C 
TLESR2                    x00602900  ECC Syndrome 0  x00000000 
                                     ECC Syndrome 1  x00000029 
                                     CORRECTABLE READ ECC ERROR 
  Error Syndrome 0              x00  No Error 
  Error Syndrome 1              x29  Data Bit = 139 
TLESR3                    x00409090 
Palcode Revision          x0000000600000301 
                                     Palcode Rev: 3.1-1 
*TLaser CPU Registers*                 
TLSB Node Number                  0. 
TLDEV                         x8014  Turbo-Laser Dual CPU, 4meg Bcache 
TLBER                     x00440000  CORRECTABLE READ DATA ERROR 
                                     DATA SYNDROME 2 
TLCNR                     x00000200 
TLVID                     x00000010 
TLESR0                    x00405400 
TLESR1                    x00400C0C 
TLESR2                    x00602900  ECC Syndrome 0  x00000000 
                                     ECC Syndrome 1  x00000029 
                                     CORRECTABLE READ ECC ERROR 
TLESR3                    x00409090 
TLEPAERR                  x00000000 
MODCONFIG                 x00098AD4  Lockout Enable 
                                     Command Piping To EV5 Disabled 
                                     Bcache Size:   4 MB 
                                     Bcache Idle Cycles Before 11. 
                                     Max Command Queue Entries 2. 
                                     Max Bus Queue Entries   4. 
TLEPMERR                  x00000000 
TLEPDERR                  x00000000 
TLEP Interrupt Mask 0     x000000FE  IPL 14 Interrupt Enable 
                                     IPL 15 Interrupt Enable 
                                     IPL 16 Interrupt Enable 
                                     IPL 17 Interrupt Enable 
                                     Interprocessor Interrupt Enable 
                                     Interval Timer Interrupt Enable 
                                     CPU Halt Enable 
TLEP Interrupt Summary 0  x00000040  Interval Timer Interrupt
Outstanding 
TLEP Interrupt Mask 1     x00000000 
TLEP Interrupt Summary 1  x00000000 
*TLaser CPU Registers*                 
TLSB Node Number                  1. 
TLDEV                         x8014  Turbo-Laser Dual CPU, 4meg Bcache 
TLBER                     x00800000 
TLCNR                     x00000210 
TLVID                     x00000032 
TLESR0                    x00000303 
TLESR1                    x00000303 
TLESR2                    x00000303 
TLESR3                    x00000303 
TLEPAERR                  x00000000 
MODCONFIG                 x00098AD4  Lockout Enable 
                                     Command Piping To EV5 Disabled 
                                     Bcache Size:   4 MB 
                                     Bcache Idle Cycles Before 11. 
                                     Max Command Queue Entries 2. 
                                     Max Bus Queue Entries   4. 
TLEPMERR                  x00000000 
TLEPDERR                  x00000000 
TLEP Interrupt Mask 0     x000000FE  IPL 14 Interrupt Enable 
                                     IPL 15 Interrupt Enable 
                                     IPL 16 Interrupt Enable 
                                     IPL 17 Interrupt Enable 
                                     Interprocessor Interrupt Enable 
                                     Interval Timer Interrupt Enable 
                                     CPU Halt Enable 
TLEP Interrupt Summary 0  x00000000 
TLEP Interrupt Mask 1     x00000000 
TLEP Interrupt Summary 1  x00000000 
* TLaser Memory Regs *                 
TLSB Node Number                  7. 
TLDEV                         x5000  Turbo-Laser Memory Module 
TLBER                     x01440000  CORRECTABLE READ DATA ERROR 
                                     DATA SYNDROME 2 
                                     DATA TRANSMITTER DURING ERROR 
TLCNR                     x000FC270 
TLVID                     x00000080 
FADR                      x078200003300CF40 
FADR 1                    x07820000  Failing Command:    Read 
                                     Failing Bank =   Bank 8 
TLESR0                    x00005400 
TLESR1                    x00000C0C 
TLESR2                    x00212900  ECC Syndrome 0  x00000000 
                                     ECC Syndrome 1  x00000029 
                                     TRANSMITTER DURING ERROR 
                                     CORRECTABLE READ ECC ERROR 
  ECC Code                      x00 
  Second ECC Code               x29  Failing SIMM Number = J17 
TLESR3                    x00009090 
TMIR                      x80000001  Interleave  x00000001 
TMCR                      x0000023D  2GB Module (E2036-AA) 
                                     16 MB 
                                     70ns DRAM 
                                     Strings Installed =   8 
                                     DRAM timing:   Bus Spd = 13.0-15.0; 
                                                    Refresh Cnt = 1008 
TMER                      x00000005  Failing String =   x00000005 
TMDRA                     x00000000  Refresh Rate   1X 
TDDR0                     x00000000 
TDDR1                     x00000000 
TDDR2                     x00000000 
TDDR3                     x00000000 
* TLaser I/O Registers *               
TLSB Node Number                  8. 
TLDEV                         x2020  Turbo-Laser Integrated I/O Module 
TLBER                     x00000000 
FADR 0                    x0000000000000000 
FADR 1                    x00000000 
TLESR0                    x00000000 
TLESR1                    x00000000 
TLESR2                    x00000000 
TLESR3                    x00000000 
CPU Interrupt Mask        x00000001  Cpu Interrupt Mask =   x00000001 
ICCMSR                    x00000000  Arbitration Control  Minimum
Latency Mode 
                                     Supress Control  Suppress after 16 
                                                      Transations 
ICCNSE                    x80000000  Interrupt Enable on NSES Set 
ICCMTR                    x00000000 
IDPNSE-0                  x00000006  Hose Power OK 
                                     Hose Cable OK 
IDPNSE-1                  x00000006  Hose Power OK 
                                     Hose Cable OK 
IDPNSE-2                  x00000000 
IDPNSE-3                  x00000000 
IDPVR                     x00000800 
ICCWTR                    x00000000 
TLMBPR                    x0000000000000000 
IDPDR0                    x20000000 
IDPDR1                    x20000000 
IDPDR2                    x00000000 
IDPDR3                    x00000000 
----- snip ----- snip ----- snip ----- snip ----- snip -----
As you can see from the "TLaser Memory Regs" section, it will specify
the
memory down to the module, bank, and SIMM.  Some other things to keep in
mind 
are:
(1) There will be a certain noise level of ECC recovered errors, even
with 
    perfectly good memory.  The causes have to do with things like
cosmic
    rays, naturally occuring isotopes in the memory losing neutrons
    at high speed, EMF, power spikes, and other perfectly normal reasons
why 
    bits may get twiddled when they shouldn't.  That is why we pay the
extra
    $$ for ECC RAM.  I wouldn't get worried unless you start to see a
lot of
    errors from one or two SIMMs.
    
(2) DECEvent (and uerf) seem to belive that these are not memory errors,
but
    are instead CPU errors and you must extract them as such.  Here is
the
    command line I used to pull the entry above out:
    
        dia -icpu -R -o full > /tmp/delme
    
    Please note I haven't said anything about using DECEvent's
auto-diagnostic
    features.  That's because it managed to miss a failing simm on our
8400
    that had a couple of hundred errors logged on it.  DEC is still
trying to 
    figure out why it didn't recognize it.
    
    
Hope this helps,
Tom
--
+--------------------------------+------------------------------+
| Tom Webster                    | "Funny, I've never seen it   |
| SysAdmin MDA-SSD ISS-IS-HB-S&O | do THAT before...."          |
| webster_at_ssdpdc.mdc.com         | - Any user support person    |
+--------------------------------+------------------------------+
|   Unless clearly stated otherwise all opinions are my own.    |
+---------------------------------------------------------------+
---
You've got an self-correcting parity error on one (or many) of you
memory
chips. I suppose you got that output from uerf.  It's not dangerous as
these
chips are equipped with single bit error correction hardware. This could
worsen if you have 2 bad bits on a memory cell, in that case, the
processor
will not be able to correct the error and the system would (probably)
crash.
If that kind of error persists, contact DEC and have them replace the
faulting memory board if it's under guarantee (those beasts are NOT
CHEAP).
We once filled the binary.errlog file with that kind of message. The
binary.errlog file filled to 72 Mb in 10 minutes ! Result: File system
full,
system crawling to its knees.... We cleared the log and got no problems
for
a month or so... We regularly checked the uerf log and the problem
resurfaced a month later, we had DEC replace the memory board. 
Also, sometimes uerf (incorrectly) report that kind of error as a CPU
EXCEPTION due to a bug in the uerf binary.errlog parsing. This should be
fixed in revisions later than 3.2D-1.
                                        HTH
Guy Dallaire
dallaire_at_total.net
"God only knows if god exists"
---
Hi;
 I recieved the exact same error on the exact same machine type 
 DIGITAL said it was a correctable simm error and 
         to watch for repeats then worry.
         "replace the sim"
          I only see two in uerf 
  
  Scot
Scot <scot_at_engrs.infi.net>
---
It means you've got a bad memory chip that the ECC circuits are
compensating for.  You need to get the board or chip replaced.
regards,
Ross
-- 
Ross Alexander, ve6pdq  --  (403) 675 6311  --  rwa_at_cs.athabascau.ca
---
It was a message about a fixed memory error. If you got one, it's 
probably no problem. If you tens a day, you should look into it. 
ECC memory corrects single bit failures and detects double, but cannot 
fix them. So if you get lots of these messages you should probably 
replace your memory.
Harald Lundberg <hl_at_tekla.fi>;Tekla Oy,Koronakatu
1,FIN-02210,ESPOO,FINLAND
tel
+358-{9-8879449work,9-8039489fax,9-8026752,19-2418013res,50-5578303mob)
---
-- 
Nicci Roth
OTA Limited Partnership
1 Manhattanville Rd.
Purchase, NY 10577
phone: 914/694 5800
fax: 914/694 5831
email: nicci_at_ox.com
If you can't find any other meaning in everything that's 
happening, try to consider it as entertainment.
Received on Tue Dec 03 1996 - 19:55:13 NZDT
This archive was generated by hypermail 2.4.0 : Wed Nov 08 2023 - 11:53:47 NZDT