Soft errors caused by single-event upsets (SEUs) aka. ECC RAM absolutely matters

About

Introduction

We’ve collected a few amusing and interesting things about bit flipping caused by cosmic rays the other day. Please rest assured we are only using ECC memory - Wikipedia for our server machines.

DRAM quotes

  • “A bit flipping at random is not a problem solely related to broken memory. Perfectly healthy memory is also subject, with a small probability, to bit flipping because of cosmic rays. […] According to a few sources, including IBM, Intel and Corsair, a computer with a few GB of memory of non-ECC memory is likely to incur to several memory errors every year.”

    Redis Crashes - <antirez>

  • “Currently the probable primary source of soft errors in DRAM is electrical disturbance caused by terrestrial cosmic rays, which are very high-energy subatomic particles originating in outer space.”

    https://www.cs.princeton.edu/~appel/papers/memerr.pdf
    – via: dinaburg.org

  • “A system on Earth, at sea level, with 4 GB of RAM has a 96% percent chance of having a bit error in three days without ECC RAM. With ECC RAM, that goes down to 1.67e-10 or about one chance in six billions.”

    http://lambda-diode.com/opinion/ecc-memory

  • “Based on sea-level bit upset probabilities given in Eugene Normand’s SEU at ground level paper, I computed that if you have 4 GiB of memory, you have 96% chance of getting a bit flip in three days because of cosmic rays. SECDED ECC would reduce that to a negligible one chance in six billion.”

    http://lambda-diode.com/opinion/ecc-memory-2
    http://web.archive.org/web/20090226195204/http://www.boeing.com/assocproducts/radiationlab/publications/SEU_at_Ground_Level.pdf

  • DRAM Errors in the Wild: A Large-Scale Field Study

    Abstract
    Errors in dynamic random access memory (DRAM) are a common form of hardware failure in modern compute clusters. Failures are costly both in terms of hardware replacement costs and service disruption. While a large body of work exists on DRAM in laboratory conditions, little has been reported on real DRAM failures in large production clusters. In this paper, we analyze measurements of memory errors in a large fleet of commodity servers over a period of 2.5 years. The collected data covers multiple vendors, DRAM capacities and technologies, and comprises many millions of DIMM days.

    DRAM Errors in the Wild: A Large-Scale Field Study – Google Research
    http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf

  • "It’s a well-documented fact that RAM in modern computers is susceptible to occasional random bit flips due to various sources of noise, most commonly high-energy cosmic rays. By some estimates, you can even expect error rates as high as one error per 4GB of RAM per day! Many servers these days have ECC RAM, which uses extra bits to store error-correcting codes that let them correct most bit errors, but ECC RAM is still fairly rare in desktops, and unheard-of in laptops.

    For me, bitflips due to cosmic rays are one of those problems I always assumed happen to “other people”. I also assumed that even if I saw random cosmic-ray bitflips, my computer would probably just crash, and I’d never really be able to tell the difference from some random kernel bug.

    A few weeks ago, though, I encountered some bizarre behavior on my desktop, that honestly just didn’t make sense. I spent about half an hour digging to discover what had gone wrong, and eventually determined, conclusively, that my problem was a single undetected flipped bit in RAM. I can’t prove whether the problem was due to cosmic rays, bad RAM, or something else, but in any case, I hope you find this story interesting and informative."

    https://blogs.oracle.com/linux/post/attack-of-the-cosmic-rays

  • “A two-and-a-half year study of DRAM on 10s of thousands Google servers found DIMM error rates are hundreds to thousands of times higher than thought – a mean of 3,751 correctable errors per DIMM per year. This is the world’s first large-scale study of RAM errors in the field.”

    Nightmare on DIMM street | StorageMojo
    DRAM error rates: Nightmare on DIMM street | ZDNet

  • “Some of the numbers are really terrifying: 4.15% unrecoverable errors for of the platforms are much more then i had thought and I’m somewhat conservative in my thinking how far i trust hardware. Furthermore hard errors (as in “bit permanently flipped and put it to the trashbin”) are vastly more common reasons for errors as most people think.”

    Observations on memory reliability - c0t0d0s0.org

  • Amazon S3 Availability Event: July 20, 2008

    For example, in 2008, Amazon S3 was brought down for several hours when a single-bit hardware error propagated through the system.

    http://status.aws.amazon.com/s3-20080720.html

  • Real Programmers
    http://xkcd.com/378/

  • For a brief period, the Windows kernel tried to deal with gamma rays corrupting the processor cache

    For a brief period, the kernel tried to deal with gamma rays corrupting the processor cache – The Old New Thing
    – via: For a brief period, the Windows kernel tried to deal with gamma rays | Hacker News

General information

DRAM off-topic

ECC memory does not protect against everything.
Exploiting Correcting Codes: On the Effectiveness of ECC Memory Against Rowhammer Attacks

SRAM research


P.S.: In its infancy, this article was called Re: Memory errors / ECC memory vs. cosmic rays.

How to Design Highly Reliable Digital Electronics

Hab ick dem moll jezeigt, ehe hier @weef sich gleich aufn Plan gerufen fühlt :P

1 Like

ECC absolutely matters

Linus Torvalds also recommends to use ECC RAM and shares his insights into the ECC industry within this nice rant.

Thanks for sharing, Fabrizio!

In this context, I wanted to share an entertaining astrophysicist’s introduction to cosmic rays with you, by Matthew John O’Dowd.

Peter Gutmann also has some slides about the topic at Software Security in the Presence of Faults.

Tim McNamara offers a software-only fault resistance solution for Safe Booleans at GitHub - timClicks/coin-boolean: A bit flip resistant Boolean type.

– via: https://news.ycombinator.com/item?id=34126239

Another popular application of this feature. Thanks for sharing, Tim!