About
Introduction
We’ve collected a few amusing and interesting things about bit flipping caused by cosmic rays the other day. Please rest assured we are only using ECC memory - Wikipedia for our server machines.
DRAM quotes
-
“A bit flipping at random is not a problem solely related to broken memory. Perfectly healthy memory is also subject, with a small probability, to bit flipping because of cosmic rays. […] According to a few sources, including IBM, Intel and Corsair, a computer with a few GB of memory of non-ECC memory is likely to incur to several memory errors every year.”
-
“Currently the probable primary source of soft errors in DRAM is electrical disturbance caused by terrestrial cosmic rays, which are very high-energy subatomic particles originating in outer space.”
– https://www.cs.princeton.edu/~appel/papers/memerr.pdf
– via: dinaburg.org -
“A system on Earth, at sea level, with 4 GB of RAM has a 96% percent chance of having a bit error in three days without ECC RAM. With ECC RAM, that goes down to 1.67e-10 or about one chance in six billions.”
-
“Based on sea-level bit upset probabilities given in Eugene Normand’s SEU at ground level paper, I computed that if you have 4 GiB of memory, you have 96% chance of getting a bit flip in three days because of cosmic rays. SECDED ECC would reduce that to a negligible one chance in six billion.”
– http://lambda-diode.com/opinion/ecc-memory-2
– http://web.archive.org/web/20090226195204/http://www.boeing.com/assocproducts/radiationlab/publications/SEU_at_Ground_Level.pdf -
DRAM Errors in the Wild: A Large-Scale Field Study
Abstract
Errors in dynamic random access memory (DRAM) are a common form of hardware failure in modern compute clusters. Failures are costly both in terms of hardware replacement costs and service disruption. While a large body of work exists on DRAM in laboratory conditions, little has been reported on real DRAM failures in large production clusters. In this paper, we analyze measurements of memory errors in a large fleet of commodity servers over a period of 2.5 years. The collected data covers multiple vendors, DRAM capacities and technologies, and comprises many millions of DIMM days.– DRAM Errors in the Wild: A Large-Scale Field Study – Google Research
– http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf -
"It’s a well-documented fact that RAM in modern computers is susceptible to occasional random bit flips due to various sources of noise, most commonly high-energy cosmic rays. By some estimates, you can even expect error rates as high as one error per 4GB of RAM per day! Many servers these days have ECC RAM, which uses extra bits to store error-correcting codes that let them correct most bit errors, but ECC RAM is still fairly rare in desktops, and unheard-of in laptops.
For me, bitflips due to cosmic rays are one of those problems I always assumed happen to “other people”. I also assumed that even if I saw random cosmic-ray bitflips, my computer would probably just crash, and I’d never really be able to tell the difference from some random kernel bug.
A few weeks ago, though, I encountered some bizarre behavior on my desktop, that honestly just didn’t make sense. I spent about half an hour digging to discover what had gone wrong, and eventually determined, conclusively, that my problem was a single undetected flipped bit in RAM. I can’t prove whether the problem was due to cosmic rays, bad RAM, or something else, but in any case, I hope you find this story interesting and informative."
– https://blogs.oracle.com/linux/post/attack-of-the-cosmic-rays
-
“A two-and-a-half year study of DRAM on 10s of thousands Google servers found DIMM error rates are hundreds to thousands of times higher than thought – a mean of 3,751 correctable errors per DIMM per year. This is the world’s first large-scale study of RAM errors in the field.”
– Nightmare on DIMM street | StorageMojo
– DRAM error rates: Nightmare on DIMM street | ZDNet -
“Some of the numbers are really terrifying: 4.15% unrecoverable errors for of the platforms are much more then i had thought and I’m somewhat conservative in my thinking how far i trust hardware. Furthermore hard errors (as in “bit permanently flipped and put it to the trashbin”) are vastly more common reasons for errors as most people think.”
-
Amazon S3 Availability Event: July 20, 2008
For example, in 2008, Amazon S3 was brought down for several hours when a single-bit hardware error propagated through the system.
-
Real Programmers
http://xkcd.com/378/ -
For a brief period, the Windows kernel tried to deal with gamma rays corrupting the processor cache
– For a brief period, the kernel tried to deal with gamma rays corrupting the processor cache – The Old New Thing
– via: For a brief period, the Windows kernel tried to deal with gamma rays | Hacker News
General information
- Serious Computer Glitches Can Be Caused By Cosmic Rays - Slashdot
- The Invisible Neutron Threat | National Security Science Magazine | Los Alamos National Laboratory
- Computer crashes may be due to forces beyond our solar system | Computerworld
- statistics - Cosmic Rays: what is the probability they will affect a program? - Stack Overflow
DRAM off-topic
ECC memory does not protect against everything.
Exploiting Correcting Codes: On the Effectiveness of ECC Memory Against Rowhammer Attacks
SRAM research
- Cosmic-ray multi-error immunity for SRAM, based on analysis of the parasitic bipolar effect | IEEE Conference Publication | IEEE Xplore
- SRAM immunity to cosmic-ray-induced multierrors based on analysis of an induced parasitic bipolar effect | IEEE Journals & Magazine | IEEE Xplore
- ShieldSquare Captcha
- https://pdfs.semanticscholar.org/7f15/61e0c7c25fa1021b51a419b692a3728bb80e.pdf
- Review of Accelerated Testing of SRAMs
P.S.: In its infancy, this article was called Re: Memory errors / ECC memory vs. cosmic rays.