The Habitat of Hardware Bugs

HeyLaughingBoy · on July 14, 2016

It's bug-free if and only if they can't sell it with bugs

This is so very true. A long time ago I was writing a device driver for a chip. I kept running into a problem and spent days looking for the bug in my code. After all, it had to be my code. No way the chip would fail to work in this mode: thousands of customers would be screaming bloody murder.

Finally, I gave up and called my rep at TI. And found out... they knew about the bug and were in the process of fixing it. Why weren't all those customers complaining of a bug in the chip's most basic mode? Well "actually you guys are only the second company to buy this version of the chip..."

joezydeco · on July 14, 2016

Oh man, chip errata documents are incredibly scary things. It makes you wonder how the thing works at all.

Here's the current errata for the Freescale iMX6D/Q. All 225 pages of it.

http://cache.freescale.com/files/32bit/doc/errata/IMX6DQCE.p...

userbinator · on July 14, 2016

The errata for x86 CPUs are even scarier... especially when you consider the fact that some of them aren't publicly available.

rasz_pl · on July 14, 2016

200 pages? thats average for Microchip

joezydeco · on July 14, 2016

Yeah I guess this is actually pretty good for an SoC the size of the iMX.

ChuckMcM · on July 14, 2016

That is a pretty reasonable way of looking at it. One of the things that made NetApp interesting when I was there was ONTap, a completely custom OS with one memory space and no user mode. When you thought about it made sense, all you need for a NAS box is a really feature rich ethernet driver :-). Anyway, what it meant was that NetApp would uncover problems in CPUs and Chipsets that nobody in the "PC" world would ever see. Race conditions on the frontside bus, PCI express traffic that would freeze up the chipset etc. It was also true of drive firmware. Drives have all these commands which look good in the manual except no PC ever calls them in production. As a result they don't get a lot of testing. We discovered that 'write zeros' which was a command for zeroing out a disk, on some firmware revs was "write mostly zeros, except when you don't." Never good when you're trying to initialize RAID stripes. As a result there was always a "Netapp version" of the drive firmware which had been qualified but customers always believed it was just a way of preventing them from using commodity drives[1].

Any time you step off the beaten path and try to use a complex technology in an "unusual" way, you are blazing a trail which may not have been traveled before. Always good to be on the lookout for undocumented bugs.

[1] It did have that effect but it wasn't the motivation.

yosefk · on July 14, 2016

This works at higher levels of abstraction, too. For instance, NetApp filers have a deduplication feature, where identical files are detected and stored once instead of several times. When one of the files is changed, supposedly a copy on write happens. Yet in practice, I saw, more than once, two identical files, with completely identical time stamps, owned by two different users, where only one file was modified intentionally by a program ran by one of the users (that program logged its actions to a file, and the other user's log would be empty, plus there was no chance that both modified their files at the exact same time.) I concluded that NetApp's deduplication wasn't on the beaten path - or perhaps something in the timing or other specifics of our creation and modification of identical files was unusual.

tw04 · on July 14, 2016

The first problem with your example is that NetApp deduplication occurs at the block level, not the file level. The second problem is, given the number of systems in the field utilizing it - if your example were accurate there would be literally THOUSANDS of people up in arms.

Furthermore, their deduplication is post-process, so even if dedup were to somehow modify atime, which it doesn't, you wouldn't have seen the access time change for at best 24 hours after the file was modified.

Troll on.

yosefk · on July 14, 2016

Then it wasn't deduplication. I swear it was a NetApp file server, two files, one modified by a program that logged the change, the other getting the same bits as the first, the time stamps were completely identical. Dedup was just a guess.

tw04 · on July 14, 2016

completely custom is a bit of a stretch. It was a very heavily modified FreeBSD. And with 8.x 7-mode, it was straight FreeBSD with "ONTAP" essentially loaded as a bunch of userland process'.

ChuckMcM · on July 14, 2016

Actually no. Shared networking stack code, sure, influenced? possibly as most of the engineers were ex-Sun or ex-SGI, but that was about as far as it went. That changed when Spinnaker was acquired and NetApp ported their Linux stuff to FreeBSD. All Data ONTap code prior to 8 was pretty unique to the NetApp.

tw04 · on July 14, 2016

That's... not accurate. There's a reason every time FreeBSD made SMP improvements, ONTAP was quick to follow. Spinnaker != 7-mode, and cDOT took SpinNP and very little else from that codebase. I don't know what history of ONTAP you followed but it's not accurate.

There's a reason NetApp has been one of the largest contributors to FreeBSD both in terms of code and monetary support since long before they acquired Spinnaker.

ChuckMcM · on July 14, 2016

Well my "history" was that I was the Technical Director and later Senior Technical Director in the ONTap OS group from 2001 to 2006, how about you?

mcshicks · on July 14, 2016

At least for digital asics, if you can control the temperature and core voltages one strategy to determine if the issue is hardware is to see if the problem changes (either stops or happens more frequently) at high temperature/low core voltage compared to low temperature/high voltage. That's usually a pretty good sign it's not a software/firmware problem. If you can't modify the voltage to the datasheet operating limits, you can try just temperature, but in my experience it's better to do both.

userbinator · on July 14, 2016

A good example of a pretty serious DRAM bug that showed up on PCs a short while ago --- yet with surprisingly little coverage in the media etc.:

https://www.ece.cmu.edu/~safari/pubs/kim-isca14.pdf

vardump · on July 14, 2016

What makes that different from rowhammer?

speleo_engr · on July 14, 2016

I spent weeks trying to stamp out a bug that turned out to be a signal integrity issue between the processor and DRAM. It was horrible. The bug would only happen after about 30 minutes and looked like memory corruption. I spent tons of time looking for an interrupt corrupting memory.

yosefk · on July 14, 2016

People who lie about having checked signal integrity suck. I have a horror story of my own along these lines, with very creative memory corruption.