The original hardware did all kinds of strange things, stuff like SSH not working and RPM checksums failing, random rebooting, ect. I ended up swapping the chassis, mainboard and RAM but kept the HDD’s to avoid an OS reinstall. Now, I’ve gotten this twice on the new box followed by a hard freeze about 2 months after the swap:
root@host [~]# mcelog --cpu nehalem --ascii < mce.txt CPU 2: Machine Check Exception: 4 Bank 8: fe0000400001009f TSC 38f57147ce2b8 ADDR 74a0f000 MISC c4e377c200041180 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 2 BANK 8 TSC 38f57147ce2b8 MISC c4e377c200041180 ADDR 74a0f000 MCG status:MCIP MCi status: Error overflow Uncorrected error Error enabled MCi_MISC register valid MCi_ADDR register valid Processor context corrupt MCA: MEMORY CONTROLLER RD_CHANNELunspecified_ERR Transaction: Memory read error Memory read ECC error Memory corrected error count (CORE_ERR_CNT): 1 Memory transaction Tracker ID (RTId): 0 Memory DIMM ID of error: 0 Memory channel ID of error: 1 Memory ECC syndrome: 44e377c2 STATUS fe0000400001009f MCGSTATUS 4
I realized I was running the same kernel as before the chassis swap, so I wonder if the ‘screwyness’ of the old hardware left a corrupted kernel module or two? I know it’s a long shot… but so are two identical, brand new machines having serious memory lapses. So I just upgraded the kernel (and bios for safe measure).
Edit 4/20/11 – Turns out it was hardware issues… did a full chassis swap yesterday. Haven’t cracked open the bad server yet to see what went wrong.