Soft errors caused by single-event upsets (SEUs) aka. ECC RAM absolutely matters



We’ve collected a few amusing and interesting things about bit flipping caused by cosmic rays the other day. Please rest assured we are only using ECC memory - Wikipedia for our server machines.

DRAM quotes

  • “A bit flipping at random is not a problem solely related to broken memory. Perfectly healthy memory is also subject, with a small probability, to bit flipping because of cosmic rays. […] According to a few sources, including IBM, Intel and Corsair, a computer with a few GB of memory of non-ECC memory is likely to incur to several memory errors every year.”
    Redis Crashes - <antirez>

  • “Currently the probable primary source of soft errors in DRAM is electrical disturbance caused by terrestrial cosmic rays, which are very high-energy subatomic particles originating in outer space.”

  • “A system on Earth, at sea level, with 4 GB of RAM has a 96% percent chance of having a bit error in three days without ECC RAM. With ECC RAM, that goes down to 1.67e-10 or about one chance in six billions.”
    On the need to use error-correcting memory

  • “Based on sea-level bit upset probabilities given in Eugene Normand’s SEU at ground level paper, I computed that if you have 4 GiB of memory, you have 96% chance of getting a bit flip in three days because of cosmic rays. SECDED ECC would reduce that to a negligible one chance in six billion.”

  • DRAM Errors in the Wild: A Large-Scale Field Study

  • “It’s a well-documented fact that RAM in modern computers is susceptible to occasional random bit flips due to various sources of noise, most commonly high-energy cosmic rays. By some estimates, you can even expect error rates as high as one error per 4GB of RAM per day! Many servers these days have ECC RAM, which uses extra bits to store error-correcting codes that let them correct most bit errors, but ECC RAM is still fairly rare in desktops, and unheard-of in laptops.
    For me, bitflips due to cosmic rays are one of those problems I always assumed happen to “other people”. I also assumed that even if I saw random cosmic-ray bitflips, my computer would probably just crash, and I’d never really be able to tell the difference from some random kernel bug.
    A few weeks ago, though, I encountered some bizarre behavior on my desktop, that honestly just didn’t make sense. I spent about half an hour digging to discover what had gone wrong, and eventually determined, conclusively, that my problem was a single undetected flipped bit in RAM. I can’t prove whether the problem was due to cosmic rays, bad RAM, or something else, but in any case, I hope you find this story interesting and informative.”

  • “A two-and-a-half year study of DRAM on 10s of thousands Google servers found DIMM error rates are hundreds to thousands of times higher than thought – a mean of 3,751 correctable errors per DIMM per year. This is the world’s first large-scale study of RAM errors in the field.”

  • “Some of the numbers are really terrifying: 4.15% unrecoverable errors for of the platforms are much more then i had thought and I’m somewhat conservative in my thinking how far i trust hardware. Furthermore hard errors (as in “bit permanently flipped and put it to the trashbin”) are vastly more common reasons for errors as most people think.”
    Observations on memory reliability -

  • Amazon S3 Availability Event: July 20, 2008
    For example, in 2008, Amazon S3 was brought down for several hours when a single-bit hardware error propagated through the system.

  • Real Programmers

  • For a brief period, the Windows kernel tried to deal with gamma rays corrupting the processor cache
    For a brief period, the Windows kernel tried to deal with gamma rays | Hacker News

General information

DRAM off-topic

ECC memory does not protect against everything.
Exploiting Correcting Codes: On the Effectiveness of ECC Memory Against Rowhammer Attacks

SRAM research

P.S.: In its infancy, this article was called Re: Memory errors / ECC memory vs. cosmic rays.

How to Design Highly Reliable Digital Electronics

Hab ick dem moll jezeigt, ehe hier @weef sich gleich aufn Plan gerufen fühlt :P

1 Like

ECC absolutely matters.

Linus Torvalds also recommends to use ECC RAM and shares his insights into the ECC industry within this nice rant.

Thanks for sharing, Fabrizio!