What Flips Your Bit: Cosmic Ray Errors at Mozilla

IncRnd · on April 13, 2022

"Bitsquatting is a form of cybersquatting which relies on bit-flip errors that occur during the process of making a DNS request. These bit-flips may occur due to factors such as faulty hardware or cosmic rays. When such an error occurs, the user requesting the domain may be directed to a website registered under a domain name similar to a legitimate domain, except with one bit flipped in their respective binary representations.

"A 2011 Black Hat paper detailed an analysis where eight legitimate domains were targeted with thirty one bitsquat domains. Over the course of one day, 3,434 requests were made to bitsquat domains." [1]

Cisco presented a paper on bitsquatting at defcon, "Examining the Bitsquatting Attack Surface". From the paper, "The conclusion is that the possibility of bitsquat attacks is more widespread than originally thought, but several techniques exist for mitigating the effects of these new attacks." [2]

[1] https://en.wikipedia.org/wiki/Bitsquatting

[2] https://media.defcon.org/DEF%20CON%2021/DEF%20CON%2021%20pre...

anonymousiam · on April 13, 2022

I had the pleasure of working with the author for a brief time, and I attended his presentation. Great stuff. What I found particularly interesting is some later work that characterized the probability of error based upon the device type, and the ambient temperature (based on IP Geo-location).

numlock86 · on April 14, 2022

> "A 2011 Black Hat paper detailed an analysis where eight legitimate domains were targeted with thirty one bitsquat domains. Over the course of one day, 3,434 requests were made to bitsquat domains." [1]

So I had a look into the paper, because I was curious how they would differentiate typos from bitflips ... turns out they don't. Bummer. There's only some assumptions made here and there without any further explanation. It could even be the result of bad OCR or whatever weird stuff people do on the internet ...

I am not questioning the result and also believe it's most likely bitflips in a lot (most?) cases, but I would have really loved some more insight in that domain ... pun intended.

dahfizz · on April 13, 2022

How does a comic bit flip make it past the Ethernet CRC?

aaaaaaaaaaab · on April 13, 2022

It occurs in your machine before hitting the network.

sbierwagen · on April 14, 2022

Raymond Chen said a few years ago that overclocked machines produce a lot of spurious crash reports: https://devblogs.microsoft.com/oldnewthing/20050412-47/?p=35...

If those computers are scribbling over memory at random, then that's occasionally going to cause some crashes, but much more commonly result in silent data corruption.

octoberfranklin · on April 14, 2022

because your machine vendor was too cheap to use ECC DRAM.

bobsmooth · on April 14, 2022

Thank you Mr. Torvalds

octoberfranklin · on April 14, 2022

Would you like to hear my opinion of NVidia?

Please be sure to remove any young children from the room beforehand.

chadwittman · on April 13, 2022

Amazing comment, this is wild. Thank you for sharing this!

pitaj · on April 13, 2022

Is one mitigation TLS certificate verification?

simulate-me · on April 13, 2022

It depends on whether or not the client specifically requested, or is expecting, traffic over HTTPS. But yes, if the user's client requests encrypted traffic, the attacker will not be able to produce a valid certificate. This attack isn't that different than a MITM.

tedunangst · on April 13, 2022

Nothing prevents an attacker from getting a cert for snytimg.com or oslashdot.org.

legalcorrection · on April 13, 2022

That only helps the attacker if the error happened before reaching the DNS-specific path. If the error happens inside the DNS path, then the browser is still expecting to get a certificate for the correct website.

tedunangst · on April 13, 2022

If the error happens inside the DNS path, the name in the answer won't match the name in the query. The browser is still expecting to get an answer for the hostname it sent.

legalcorrection · on April 13, 2022

Depends on where in the stack the error happened.

xenophonf · on April 13, 2022

[flagged]

rat9988 · on April 13, 2022

> Why not read the linked paper,

You answered your own question. The answer is in page 12, which means there is too much information. He is not interested in the whole topic, just about this question. So he asks, maybe someone is charitable enough to answer. Nothing wrong with it.

nixpulvis · on April 13, 2022

Skim until you are in the right section?

Literally titled: "Section II - Mitigation of bitsquatting attacks"

xenophonf · on April 13, 2022

How could I possibly answer their question better than the experts who wrote the paper?

rat9988 · on April 13, 2022

Maybe you can't but someone else can. The question is open to anyone who can and wants to answer.

xenophonf · on April 14, 2022

But the answer is right there, section 2, starting on page 12 of the linked paper.

We're hackers. Autodidacticism is our creed. Research papers are our bread and butter. Why settle for speculation from the pseudonymous masses when a rich vein of well written analysis by reputable authors is one click away?

tconfrey · on April 13, 2022

I worked at Sonus Networks (now Ribbon[0]) in the early 2000's building VoIP solutions for telcos. We had a bunch of unexplained errors in a new installation in Denver. After much head scratching the engineers on the problem concluded that the higher altitude significantly increased the likelihood of impact by alpha particles and that that was the cause of the problem!

(IIRC we increased the shielding on the devices.)

https://ribboncommunications.com/

simne · on April 13, 2022

> higher altitude significantly increased the likelihood...

This idea looks promise, but Denver is not high enough to make much difference from sea level.

From experience, radiation detector shows significant raise from ~6000m, and at standard civilian airplane echelon 10k m, it buzzes continuously :)

Most probably, errors caused from electric surges or from some local radiation anomaly, or from something like radar equipment.

- I hear, in few places in world exists old buildings, even churches, build from stone with high content of natural radionuclides, so there radiation is significantly higher than standard level, but not too high to cause fast people death, so nobody care, but it departments found unexplained errors ;)

Btw, stone with high content is widely distributed. After Chernobyl, we pay this much attention, and know, for this first care in Sweden in 1940s, all other countries after 1950s. And one person even said me, that near our city there are two mines, stone from one used for industrial buildings, for roads, and from other for homes.

perihelions · on April 13, 2022

Minor clarification: alpha radiation doesn't come from cosmic rays, rather from U/Th contamination in the circuit materials. The altitude dependent component you'd be seeing is rather muons and neutrons.

cozzyd · on April 13, 2022

you can get alphas from spallation, but you wouldn't really call that alpha radiation.

jaytaylor · on April 13, 2022

Maybe it's a stupid question, but did the additional shielding completely resolve the discrepancy?

geophile · on April 13, 2022

As the article points out, using collected client data is problematic, because some errors will often be undetectable, as in numeric data. And in general, you would have to control for bit flips somehow caused by software.

I wonder whether a SETI approach would be useful here. Allocate, say, 1MB of memory. Fill it with some known bit pattern. Periodically check the memory and look for discrepancies. Do this once an hour, on 10M devices, and that is a LOT of monitoring. Report discrepancies along with time, location (including elevation), hardware and OS information.

I would think that this approach would provide a lot of interesting information about when and where bit flips occur, especially when matched against information on solar and atmospheric events (as in the article). Perhaps sensitive hardware and OS environments would be detected. Even completely negative results would be interesting: no bit flips observed would suggest that purported bit flips elsewhere might have other explanations.

simne · on April 13, 2022

I've few years ago hear, US gov't created app, which periodically checked for random light flashes on camera sensor of smartphone, and send info to cloud, to use them as large distributed network for detecting of illegal nukes. (btw, Soviet and Russians really love to collect such weird facts).

Idea looks very realistic, except that it will drain battery, and could also be used as surveillance tool, so I have not seen real app.

To be strict, any digital camera sensor is excellent tool for such things, much better than ram, only need to close objective with something opaque for light, but transparent to particles (any thin plastic fit), and sure, run monitoring program and store logs in cloud.

And in real life, on ali could buy Geiger counter shield for arduino, and one my pal even expose such sensor to internet.

- Better to use two such sensors and simple logical circuit, so they will detect not all events, but only when two sensors simultaneous detect something, so you will see vector, from where cosmic particle appear.

- This method of many sensors with logic and logs, where used in researches, which detected hidden rooms in Great Pyramids of Giza (real cosmic rays so powerful, that could detected with simple equipment at depths up to hundred meters, so ordinary concrete is like cardboard for them). And I even seen post from one guy, who installed such machinery at home (used 8 or 16 counters, I forgot details), and in few months from logs was clearly seen, where in his room is window :)

ajxs · on April 14, 2022

> "I've few years ago hear, US gov't created app..."

At least at one point they had satellites deployed for this purpose. The phenomena they used to determine whether a detonation had taken place was the characteristic 'double flash'. You can read some more about this here: https://en.wikipedia.org/wiki/Vela_incident

They don't need to resort to such clandestine measures as infiltrating people's smartphones for this purpose. A nuclear blast was a pretty difficult thing to hide even 40 years ago.

simne · on April 14, 2022

> satellites deployed for this purpose

You missing point. I know about world system of monitoring for nuke explosions, but I don't talk about it.

I mean, threat of use nuke by terrorists is very close by its consequences to real blast.

Imagine, how much panic will happen, if appear plausible claim, that terrorists have nuke, which they planning to use in one of biggest US cities.

So special semi-military agencies, constantly working on eliminating possibility that terrorists will have access even to radioactive waste, not even speaking about enriched materials.

And this is what differ Western system of values from Russian or Asian - in Western system, lost of even one person considered unacceptable, but Russians accept sacrifice 20k of soldiers for crazy escapade, which they see in what now happen in Ukraine.

EscargotCult · on April 13, 2022

This could be crowdsourced. Imagine some sort of reporting network of volunteers, running the simple program (memory allocation and periodic checks) on any hardware they're loaning time on, and submitting their location and altitude as well.

robotsteve2 · on April 13, 2022

Any sort of hardware or software error seems much more likely. Computers are incredibly complex and approximations are used everywhere (in the design of the hardware, in the theory of operation). I don't think inference-based experiments or analysis on cosmic ray bit flips are appropriate.

You really need some kind of dedicated cosmic ray detector nearby as a control. If the flux of cosmic rays into the detector is orders of magnitude lower than the rate of bit errors you ascribe to cosmic rays, it's probably some hardware/software issue and not the cosmic rays.

jldugger · on April 13, 2022

Indeed, there was a study in IEEE pointing out the absurdity of cosmic rays as causes -- one point cited was that the vast majority of bitflip happen at specific points in the address space, page boundaries between chips essentially

bqmjjx0kac · on April 13, 2022

I'm curious why that is evidence against the cosmic ray explanation.

Couldn't it have something to do with the physical layout of memory? Perhaps those page-boundary-adjacent addresses present a larger physical target, perhaps on the bus.

Of course I am wildly speculating right now. I'd love to see the article if you have a link!

jldugger · on April 15, 2022

Apologies everyone, the paper I was paper I was thinking of was ACM: "Cosmic Rays Don't Strike Twice" -- https://dl.acm.org/doi/pdf/10.1145/2248487.2150989

The point being that consumer grade memory is straight up error prone, and a bitflip isn't necessarily caused by "cosmic rays" but could just be like, a flaky DRAM chip, network card, etc.

If you are Mozilla it doesn't really matter what the source of bit flips are as much a good understanding of their prevalence and how it might impact your customer experience, telemetry, and even security.

323 · on April 13, 2022

Modern devices have tiny features which are extremely fragile to any sort of interference, which are much more abundant than cosmic rays.

See the row-hammer attack where you can flip an unrelated bit just by read/writes to adjacent bits from software!!!

Beltalowda · on April 14, 2022

This is also why memtest has all these testing patterns, and one reason why you typically want to leave it running overnight (or as long as feasible) instead of "yup, we read and wrote all bits and it's all fine!"

Row hammer is one of those things that took advantage of something everyone already kind of knew, just never thought about as a security problem.

icapybara · on April 14, 2022

This is tangential, but "Cosmic rays do not cause bit flips at a significant rate" is the default position and does not need supporting evidence.

It's like if I claimed that planes occasionally crash due to a local density of dark matter pulling them down. Nobody would need to provide evidence against my theory, it's outrageous. I'm the one who has to provide the evidence.

throwawaylinux · on April 14, 2022

> This is tangential, but "Cosmic rays do not cause bit flips at a significant rate" is the default position

No it's not. It's very well accepted in the silicon industry that cosmic ray events (the secondary particles) are responsible for enough bit flips that they have to be concerned with designing robust circuits and logic to achieve desired failure rates.

https://www.microsemi.com/document-portal/doc_view/130760-ne...

    14. Are radiation effects at ground-level just a theoretical problem?
    No, based on FIT rate data from Xilinx UG116, the largest Virtex ®-6 device (XC6VLX760) with 184,823,072 configuration bits will have a nominal failures-in-time (FIT) rate of 176 at sea-level in New York. While this represents a mean time between failures (MTBF) of 648 years, a system comprised of 1,000 FPGAs would experience a failure every year. The same systems based in Denver would experience failures every few months.

    15. Are there any widely reported incidents of errors due to charged particles?
    Several incidents across many industries have been reported in recent years. Among these:
    • In 2008, a Quantas Airbus A330-303 pitched downward twice in rapid succession, diving first 650 feet and then 400 feet, seriously injuring a flight attendant and 11 passengers. The cause has been traced to errors in an on-board computer suspected to have been induced by cosmic rays. Modifications were undertaken to mitigate such errors in the future.
    • Canadian-based St. Jude Medical issued an advisory to doctors in 2005, warning that SEUs to the memory of its implantable cardiac defibrillators could cause excessive drain on the unit's battery.
    • Cisco Systems issued a field notice in 2003 regarding its 1200 series router line cards. The noticed warned of line card resets resulting from SEUs.

https://www.intel.com/content/www/us/en/support/programmable...

     Unavoidable atmospheric neutrons remain the primary cause for SEU effects today.

https://www.asminternational.org/documents/10192/26583572/ed...

     One of the most important reliability concerns for silicon circuits is soft errors in SRAM circuits, which involves electrical upsets generated by the interaction of energetic atomic and subatomic particles with the silicon substrate material. SRAMs are particularly sensitive to radiation-induced soft errors due to the relatively low amount of charge at the storage nodes. Errors are generated by the impact of alpha particles emitted from trace amounts of uranium in solder and packaging materials of the circuit, and by neutrons that originate in the cosmic ray shower in the Earth’s atmosphere.

https://www.reliablemicrosystems.com/wp-content/uploads/2021...

    Even in the absence of on-chip sources of radiation, recent studies have conclusively proved that terrestrial cosmic rays (primarily neutrons) are a significant source of soft errors in both DRAMs and SRAMs [169-171]. Upsets have been observed both at ground level and in aircraft and have been convincingly correlated to the altitude and latitude variation of the neutron flux [172,169,171]. Lage, et al. have shown that even without alpha particles, a baseline of cosmic-ray upsets still exists for high-density SRAMs [170]. O’Gorman has shown that neutron upsets disappear for DRAMs placed 200 meters underground in a salt mine, while they increase dramatically for systems operated above 10,000 feet in Leadville, CO [169].

> and does not need supporting evidence.

Even if we did not already have a large body of evidence to show that it was a concern, that would be untrue.

"Cosmic rays do not cause bit flips at a significant rate" is no less a claim than "cosmic rays do cause bit flips at a significant rate", and would require no less evidence. It does not somehow become the "default" just because it predicts little or no interaction.

vlovich123 · on April 14, 2022

Here’s contradicting evidence to your position:

https://static.googleusercontent.com/media/research.google.c...

The point op makes is that the more complicate a claim is made, the more evidence is required. More common sources of errors would seem to be more likely and thus more common causes of bit flips.

Thus more evidence is required for the cosmic ray hypothesis being a dominant reason than anything else. We know that empirically there’s ~1 bug in every 1k lines of code. 1 in 10k if you have very good tests. But flip type errors are probably less common so let’s guess and say 1 in 10 million. There’s about ~30 million lines of code in the Linux kernel. There’s probably a similar amount of userspace code (eg Firefox is also around 20 million lines). Then think about the Verilog that backs HW designs. I don’t know the size of those codebases to have estimates but it feels like bit flip bugs are possible there. Then you’ve got to actually synthesize that digital logic and implement it in analog space. Components could easily be driven out of spec electrically (whether by accident, manufacturing defect, or swapping in lower cost components) and bit flips would be comparatively a common type of error when shuttling them around, especially sensitive across high bandwidth links that aren’t error-checked.

The point is, the combined probability of all these sources of errors seems higher probability than true cosmic rays being behind bit flips. The Google paper is just more evidence of this. I’m sure measuring just for cosmic rays you’ll be able to see their impact. In a running production at scale on variable quality hardware running on arbitrary software versions, all other sources of errors would seem like more likely first order effects that would swamp any ability to detect cosmic rays. Not to say that Mozilla hasn’t accounted for it. Just that OP’s position is the default sensible position to start from (ie Occam’s razor).

throwawaylinux · on April 14, 2022

That's not contrasting evidence. Defects are certainly common sources of error, particularly with cheap commodity components like those used in google's fleet. That does not prove cosmic rays aren't a significant source of SEUs [in any computing device].

vlovich123 · on April 14, 2022

No but it adds significant credence against the cosmic ray hypothesis. Particularly since this Mozilla blog post has lots of speculation without any actual data to support the claim.

FF is running on even lower cost commodity components without strict controls on the operating environment (eg environmental heat isn’t controlled like it is in data centers, power supplies can be borderline vs carefully species Google machines, etc).

throwawaylinux · on April 14, 2022

It does not add any "credence against" the well established fact that cosmic rays flip bits at significant rates that it has to be designed for.

cozzyd · on April 13, 2022

I'd be very interested in reading that article if you have a link (or title, or doi...)

AshamedCaptain · on April 13, 2022

I believe people use "cosmic rays" as catch-all phrase for all these very low probability error causes (just because of the coolness of cosmic rays), but in practice _any_ other cause is much more common than cosmic rays.

Even at the processor level every single transistor on it has a rated mean time between failures a.k.a. MTBF. Sure it may be astronomical, but you do have a lot of transistors, so in practice a random bitflip is not such a rare event. Designers actually explore MTBF vs power usage trade-offs here, and there is even a fascinating area of "fault resilient computing" research.

Every single clock domain crossing has another MTBF (google metastability). Again they are very high (billions of years if done properly), but you will have plenty of such crossings (and the number keeps growing with modern, more asynchronous design).

Processors are quite unreliable things.

throw10920 · on April 15, 2022

Ironically, even though the more modern, "asynchronous" (really, more just asynchronous communication between fully-synchronous clock domains) CPU designs result in more chances for metastability, a fully asynchronous, self-timed design wouldn't have to have any likelihood of metastability at all!

gnufx · on April 13, 2022

Yes, but what you'd want to do is look for coincidences between a detector for a cosmic ray shower around (above?) the electronics you're monitoring with whatever it is these days that instruments ECC events. The time resolution would be pathetic for a nuclear physics experiment, but probably good enough.

If you look at the ambient gamma-ray spectrum in a semiconductor detector (which would be germanium rather than silicon) the main background you see is typically from concrete; I'm ashamed to say I've forgotten the energy from K-40, but in the region of 1500 keV. (Ironically, large concrete blocks used for shielding would be regarded as a significant radiation hazard if all the activity in them was concentrated.)

zepearl · on April 13, 2022

I don't know folks.

2 years ago I took a laptop which I wasn't using (16 GiB RAM non-ECC) => I created in Linux with Python an array ("bytes"? Don't remember exactly anymore) of ~10 or 12 GiB containing random integers => computed the array's hash and saved it.

Then for ~1-2 months I recomputed from time to time the hash of that array (inbetween the laptop was in suspend-to-RAM) and compared it to the original result => it always matched, I never had any bitflips.

I therefore doubt that the estimation of "1/256MB/month" is correct - I could not prove that, at least not with my laptop.

cozzyd · on April 13, 2022

A server with 64 GB of ECC ram sitting at an altitude of 3.2 km on the Greenland ice sheet is reporting... 0 bit errors (whether correctable or uncorrectable) in the 244 days it's been up.

A server with 16 GB of ECC ram at an altitude of 3.8 km in California is reporting.... 0 bit errors in the 146 days it's been up.

Maybe I shouldn't believe what /sys/devices/system/edac/mc is reporting? These are EL8 systems...

zerd · on April 14, 2022

Bryan Cantrill mentions in this talk that they saw no correctable errors, until they suddenly got uncorrectable errors. Turns out that the firmware was hiding the fact that it had been correcting ECC errors the whole time. https://youtu.be/vvZA9n3e5pc?t=1193

deckard1 · on April 13, 2022

I've always been a bit skeptical of published numbers. I usually just chalk it up to vastly different operating conditions and scale.

On my home server w/ ECC you can check the corrected and uncorrected (multibit) errors. Assuming my Ryzen is correctly reporting them to Linux, I have 0 errors corrected and 0 uncorrected with a 80 day uptime. I've checked a few other times and never seen an error. Others with ECC often report the same.

My understanding of modern RAM is that it has checks built in to the modules which are somewhat equivalent to ECC already (the correcting part, not the reporting part). Which is a necessity in order to hit the density we are at today.

dijit · on April 14, 2022

DDR5 does and so does the raspberry pi memory.

But DDR4 and before does not I think.

tclancy · on April 13, 2022

>I therefore doubt that the estimation of "1/256MB/month" is correct

As someone who did incredibly poorly in high school physics, this line in the article bothered me as well: the study is from the 1990s when the density of memory would have been much lower. I would think the percentage per megabyte has dropped significantly in 30 or so years. It also assumes a constant form factor for the memory, doesn't it?

SideQuark · on April 14, 2022

The energy needed to flip a bit has drastically lowered since the 1990s, and increased density makes bit flip rates increase at some point since a single physical event can flip multiple bits.

So it's not so simple.

tclancy · on April 14, 2022

Thanks. So I was sort of right only in the exact opposite direction!

nomel · on April 13, 2022

> I therefore doubt that the estimation of "1/256MB/month" is correct

The probability is related to the physical volume the memory takes, since it's caused by a physical particle going through that volume. So, this rate will continuously drop as memory density increases.

Rebelgecko · on April 14, 2022

My understanding is that newer/denser transistors are actually more susceptible to single event upsets, because it takes less energy to flip a bit.

joatmon-snoo · on April 13, 2022

The last time I did research into this, this was the most concrete real world example I could find, which works out to ~41 bitflips/GiB-month:

> Jaguar had 360 terabytes of main memory [and] was logging ECC errors at a rate of 350 per minute

Source: https://spectrum.ieee.org/how-to-kill-a-supercomputer-dirty-...

NGC404 · on April 14, 2022

http://energysfe.ufsc.br/slides/Paolo-Rech-260917.pdf

exikyut · on April 14, 2022

This is throwing up a soft paywall for me FWIW - I get up to "...couldn’t run more than an hour or so without crashing." and then the article text stops, there's a "become a member" interstitial and then I'm fed with tons of headlines for different articles instead.

Because of the sharp contrast with the interstitial block I thought it was like a footer, and then a skippable/ignorable part of the layout, and spent some time confusedly scrolling up and down thinking the content was past it, then I realized the "Keep Reading" bits were links to other articles... wow what a mess.

</rant>

dijit · on April 14, 2022

In 2014 I worked on a video game that loaded 1,000 instances of a mutable game world into memory, each machine had 256G of ram (about 210G was allocated to the game server) and I ran about 2,500~ of those.

The large mutable state like this tends to stress the memory bandwidth, so I’m not sure it’s related, but we used ECC memory and saw a seriously insane number of corrected bit flips.

Somewhere in the order of 100/s over the whole fleet as a baseline.

there were times that this rate spiked due to a faulty RAM stick, but if you look at the distribution it was fairly even, though there were fairly random hotspots.

After observing this I decided I didn’t want to live without ECC on my personal life, to my surprise there is no laptop on the market that supports ECC. Or at least none that I could find.

This issue goes away with DDR5 and built-in ECC though, or, our visibility of it does. So, I’m happy.

simne · on April 13, 2022

Agree. I have once learn subject of cosmic rays, and found estimation, that every square meter suffer HUNDREDS events per second, of course mostly weak events.

What I really see - on digital photo dark current test videos, usually could see few flashes.

So looks like, modern ram has some sort of ecc inside chip.

And yes, digital photo sensors much better suited for detecting of cosmic rays, even simplest, sure much better (semi)professional, which could tuned to make long exposition. But even on book webcam, or on smart front camera, could see events when low light (or if close lens with some opaque material).

spullara · on April 13, 2022

I bit squatted cloudfront.net years ago and got many, many requests. Most of them *.js which would, if I were malicious, have allowed me to do just about anything. It was interesting to see that the errors definitely happened in different places. For instance, sometimes the Host header was the original domain and sometimes it matched my domain.

axg11 · on April 13, 2022

This is fascinating and hints at a future possible scientific study: using phones across the globe to map cosmic ray events. I'm not a physicist so I can't speak for the value of such data. If cosmic ray events do not occur uniformly across the globe then mapping events from 100,000s of phones could give interesting insights.

antognini · on April 13, 2022

As they say, there's an app for that:

https://cosmicrayapp.com/

Basically it monitors your camera for the streaks produced by cosmic rays. You can see the real time stream of events here:

https://cosmicrayobserver.com/#0.4/0/0

li2uR3ce · on April 13, 2022

Looks like the rays only hit populated areas. https://xkcd.com/1138/

seanw444 · on April 13, 2022

A fascinating revelation indeed.

arc-in-space · on April 13, 2022

Wait, this works? That's amazing. I suppose you could do the same with any camera sensor?

axg11 · on April 13, 2022

This is amazing - thank you for sharing!

li2uR3ce · on April 13, 2022

> In almost every case we cannot find any plausible explanation or bug

Observe the natural state of every software developer. I kid... or do I?

> What if it wasn’t just some fantastical explanation?

Doesn't sound nearly as fantastical but bad RAM is probably more common than one would expect. You seldom really know the quality of hardware you run on. Just say'n, sometimes you don't need a helping cosmic ray.

grog454 · on April 13, 2022

On the subject of bit flips, I am able to detect these in the client to server UDP packets in my game. With specific logging enabled I would see an error about once per minute while receiving about 15,000 of one type of packet per second. I was able to estimate about 1/1,000,000 packets contained a single flipped bit.

Sirened · on April 14, 2022

I mean, you are using UDP. The packets may have been damaged in transit just due to crumby wiring and not due to cosmic rays corrupting memory

kiririn · on April 15, 2022

UDP still has checksums. It drops invalid packets

grog454 · on April 15, 2022

You are right but either the checksum needs to be inspected at application layer or the 1/10^7 was undetected bit flips. At the time I read about general bit flip probability over the internet and it was consistent with what I was seeing. Maybe I can find the paper.

legalcorrection · on April 13, 2022

I suspect without great evidence that cosmic ray bitflips are mostly a scapegoat for imperfect hardware and are in fact one or two orders of magnitude less common than popular wisdom would suggest.

incomingpain · on April 13, 2022

I had the opportunity to design my SOC from scratch. Mostly ripping off Berkeley's public design.

Something I have documented in the last 2 years. Solar flare activity is what causes problems. All memory is ECC but it still happens.

Faraday cage incoming?

Wait? Faraday cage racks million $ idea?

jeffreygoesto · on April 13, 2022

Using an FD-SOI process can help reducing soft errors.

ThePhysicist · on April 13, 2022

One of the first things you'll learn when studying experimental physics is how to come up with all kinds of alternative mechanisms that might explain the result you've observed in your experiment, and then think of ways to test that the results weren't actually caused by those unwanted mechanisms. Most Nobel-prize winning physics experiments were carefully designed to compensate for any relevant secondary effects, and I would even go as far as saying that this is often the largest challenge when doing high-precision experiments.

So the first question I'd ask myself when thinking about cosmic-ray induced errors is how I would ensure that the bit flips are not caused by e.g. problems on the hard drives or the NAND array (which are probably much more likely to occur than cosmic ray events, at least on the surface of the earth).

mherdeg · on April 13, 2022

Yeah that was one of the gotchas in the story at https://blogs.oracle.com/linux/post/attack-of-the-cosmic-ray... -- MAYBE it was a bit flip due to a cosmic ray, or MAYBE it was a bit flip due to another layer of the system that makes RAM chips store and retrieve data.

I like the idea of a physicist who thinks about this and says - "well, why should we shrug and say 'maybe it was a cosmic ray?' Surely we can test this! Let's put the computer in a lead-lined enclosure and benchmark the memory failure rate and see if it changes", or whatever.

That's a great extension of the classic computer-hacker view that "of course we can understand why this bug happened, we don't have to shrug and say it segfaults until we restart sometimes, we can just dig some more." How far can you go?

withinboredom · on April 14, 2022

We see a correlation between (major) solar activity and hash/signature verification failures from clients -- on the order of millions of verifications per day, only 30k failures per day, max.

I just finished looking into it in our reporting and was pretty impressed to see spikes lineup with dates here: https://www.spaceweatherlive.com/en/solar-activity/top-50-so...

anonymousiam · on April 13, 2022

This is why I always buy ECC/EDAC capable servers. SEUs are a real thing.

Avlin67 · on April 13, 2022

What about overclocking ? does it cause bit flip ? especially low grad DDR4 pushed to its limit…

dextercd · on April 13, 2022

The '1 error for every 256MB memory a month' sounds like way tko much to me.

A program I wrote launches every time I start my computer. It allocates some memory and scans it periodically for unexpected changes. After an equivalent of 15.8 256MB/months no anomalies have been found yet.

Would really like to see more authoritative figures for modern consumer hardware.

dextercd · on April 26, 2022

It did now finally happen, after an equivalent of 18.9 256MB/months.

twoodfin · on April 14, 2022

Google did a pretty comprehensive study on their data center DRAM a decade ago:

https://static.googleusercontent.com/media/research.google.c...

eatonphil · on April 14, 2022

If I wanted to reproduce bitflipping (from any source) on my laptop (any computer, really) over the shortest time frame possible, how could I conduct that experiment? Any pointers welcome.

toast0 · on April 14, 2022

Reduce the memory voltage. This is probably easiest on a DIY desktop, because the controls are more accessible, but you might be able to use overclocking tools to adjust voltages on your laptop.

trollied · on April 13, 2022

I know HN has a decent Factorio fanbase. Factorio properly stresses PC hardware, and borderline memory is usually ok for a casual gamer until you start a Factorio megabase. A decent example is Warger who does speedruns: https://forums.factorio.com/viewtopic.php?f=7&t=100646 https://www.speedrun.com/factorio#100 Those that have played the game - speedruns are amazing to watch, if you haven't already.

BlueTemplar · on April 14, 2022

The FFF about bugfixing talking about potential cosmic ray (or bad hardware) corruption :

https://factorio.com/blog/post/fff-228

0xdeadbeefbabe · on April 14, 2022

How do they know it's cosmic rays and not something else?