:hacker_f::hacker_s::hacker_e:
:hacker_d::hacker_o::hacker_w::hacker_n::hacker_t::hacker_i::hacker_m::hacker_e:
Box tanked again, and I didn't notice because I was afk (crazy bullshit and then a nap); it looks like a bad RAM stick: uncorrected errors around the same address in the MCE logs. (The IPMI logs dutifully explain when the power was cut but nothing about the crashes.) Will be replacing it but it might be a bumpy ride until then.
So here's what I've done:
· Tweaked panic reboot settings
· Set up a quick userspace watchdog (kernel came with sample code, I tweaked it, attached), just in case that somehow helps. (I don't expect userspace will survive since syslog doesn't know anything about the crash, so it's probably panicking.)
· Wrote a local watchdog to annoy me with noises and emails when it happens.
What I'm doing now:
· Going to see about marking those regions of memory bad at boot until the RAM is replaced, which might fix it next time the box boots, maybe this can be done at runtime.
· frek evin
https://www.youtube.com/watch?v=WN4fCK23Srk2600--freedom_downtime.jpegwatchdog