Mcelog – Bad memory?

The original hardware did all kinds of strange things, stuff like SSH not working and RPM checksums failing, random rebooting, ect. I ended up swapping the chassis, mainboard and RAM but kept the HDD’s to avoid an OS reinstall. Now, I’ve gotten this twice on the new box followed by a hard freeze about 2 months after the swap:

root@host [~]# mcelog --cpu  nehalem --ascii < mce.txt
CPU 2: Machine Check Exception:     4 Bank 8: fe0000400001009f
TSC 38f57147ce2b8 ADDR 74a0f000 MISC c4e377c200041180

HARDWARE ERROR. This is *NOT* a software problem!
Please contact your hardware vendor
CPU 2 BANK 8 TSC 38f57147ce2b8
MISC c4e377c200041180 ADDR 74a0f000
MCG status:MCIP
MCi status:
Error overflow
Uncorrected error
Error enabled
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: MEMORY CONTROLLER RD_CHANNELunspecified_ERR
Transaction: Memory read error
Memory read ECC error
Memory corrected error count (CORE_ERR_CNT): 1
Memory transaction Tracker ID (RTId): 0
Memory DIMM ID of error: 0
Memory channel ID of error: 1
Memory ECC syndrome: 44e377c2
STATUS fe0000400001009f MCGSTATUS 4

I realized I was running the same kernel as before the chassis swap, so I wonder if the ‘screwyness’ of the old hardware left a corrupted kernel module or two? I know it’s a long shot… but so are two identical, brand new machines having serious memory lapses. So I just upgraded the kernel (and bios for safe measure).

Fingers crossed…

Edit 4/20/11 – Turns out it was hardware issues… did a full chassis swap yesterday. Haven’t cracked open the bad server yet to see what went wrong.

One Response to Mcelog – Bad memory?

  1. It was a hint: “HARDWARE ERROR. This is *NOT* a software problem!” 🙂 It’s better to trust mcelog records, to save your time.