This seems like precisely the sort of thing that a competent manufacturer should...

laydn · on June 25, 2017

I've done formal verification on several units of a microprocessor, and I can assure you, formal verification on individual FUs is very, very ambitious (think impossible) for a modern microprocessor.

For example, you can not possibly formally verify the fetcher unit on its own, because the state space that you need to cover for several cycles for all the module inputs and outputs is beyond the capability of any formal verification tool.

Typically, you run formal verification on sub-blocks of sub-blocks of FUs.

For this particular bug, it looks like multiple functional units are involved, so it might have been missed by formal verification.

wyager · on June 26, 2017

> because the state space that you need to cover for several cycles for all the module inputs and outputs is beyond the capability of any formal verification tool.

Then use a proof methodology that doesn't require exhaustive enumeration. This objection is actually fairly alarming to me; perhaps there is a larger disconnect between industrial formal techniques for hardware and software than I thought. Strings also occupy a "large state space", but this obviously doesn't prevent us from doing formal verification on functions over strings.

Confusion · on June 26, 2017

  Then use a proof methodology that doesn't require exhaustive enumeration.

You may as well have said "then you use a magical tool that doesn't exist".

  Strings also occupy a "large state space", but this 
  obviously doesn't prevent us from doing formal 
  verification on functions over strings.

What is your reason to think this is 'obviously' the case?

wyager · on June 26, 2017

> You may as well have said "then you use a magical tool that doesn't exist".

Ok, so what this tells me is that you're not aware of what modern verification techniques look like.

There's not really any one resource I can point you to, but take a look at these links. I've used these or similar technologies personally, but there are others I haven't used.

https://en.m.wikipedia.org/wiki/Intuitionistic_type_theory

https://en.m.wikipedia.org/wiki/Homotopy_type_theory

https://leanprover.github.io/about/

http://goto.ucsd.edu/~rjhala/liquid/liquid_types.pdf

> What is your reason to think this is 'obviously' the case?

Because I've done this, and anyone who claims to be familiar with formal methods should be at least passingly familiar with all the things I mentioned.

Even if you aren't, if you've ever heard of SAT (a fairly universal CS concept) you should at least be familiar with the idea that non-exhaustive proofs are a thing.

Confusion · on June 26, 2017

  Ok, so what this tells me is that you're not aware of what 
  modern verification techniques look like.

Well, techniques are not tools. I am asserting that it is quite probable that no practically useful tool exists to non-exhaustively verify the state space laydn wants to cover.

Notwithstanding your example that it's possible to non-exhaustively verify some functions on strings. There is a quite some distance between that to verifying just any function.

  Because I've done this,

Sure, but how does that make it obvious to laydn that his state space is coverable? And what makes it so obvious to you that his state space is coverable? After all, the 'largeness' can have different sources, including those that make non-exhaustive methods infeasible.

I think this may be an example of the disconnect between research and the industry. Researchers say things are solved when they have shown something is possible and published about it. They feel it's then up to the industry to extrapolate, while research moves on to exciting new and greener pastures. Meanwhile the industry thinks the results are too meager, thinks the extrapolation involves a lot of technical difficulties and generally is not willing to spend enough money on what they cannot see as anything but a longshot.

If you do any model checking at all, with tools like SPIN or TLA+, you are already in the state-of-the-art minority in industry.

jlouis · on June 26, 2017

I've had great success (in software) with randomized/probability based testing. You don't walk the full state space but generate subsets randomly to walk.

If done naively, it doesn't work because you are unlikely to hit the problematic bug. So you have to be clever: do not generate inputs uniformly! Generate those inputs which are really really nasty all the time. If you find a problem, then gradually simplify the case until you have reduced the complexity to something which can be understood.

With some experience for where former bugs tend to live, I've been able to remove lots of bugs from my software in edge cases via this method. (See "QuickCheck").

lorenzhs · on June 25, 2017

So it's okay for software to have bugs that get fixed (I think everybody here acknowledges that software will always have bugs), but Intel isn't allowed to have issues in their processors, even if they can fix them with a software (microcode) update?

userbinator · on June 25, 2017

In general, people expect hardware to be far more bug-free than all the software which depends on it. I'm disgusted enough at the current trend of continuously releasing half-baked software and "updating" it to fix some of the bugs (while introducing new bugs in the process); and yet it seems to be spreading to hardware too. From the perspective of formal verification, buggy hardware is like an incorrect axiom.

but Intel isn't allowed to have issues in their processors, even if they can fix them with a software (microcode) update?

They could've caught and fixed this one with some more testing, before actually releasing. Formal verification isn't necessarily going to help, if it's a statistical type of defect --- thus the concept of wafer yield, and why dies manufactured from the exact same masks can behave completely differently with respect to the clock speeds attainable and voltages required.

Cyph0n · on June 25, 2017

Formal verification cannot be used to fully verify an entire Intel CPU, simply because the computational resources required to do so are astronomical. Heck, the cache coherence unit alone is basically impossible to verify.

Therefore, CPU designers verify the important units (e.g., the ALU) independently of each other, and then try to verify their interaction. But a system level design verification check is simply not possible.

Obviously, faults that result from fabrication can still take place, so they are tested for using BIST and JTAG. Test coverage can be pretty high, but obviously not 100%.

As you can see, there are still a ton of places where hardware errors can seep through.

lorenzhs · on June 25, 2017

Expectations of hardware are indeed higher than for software, but isn't it a valid trade-off for Intel to make as long as most of the issues can be fixed with a microcode update? On Windows, it's updated via Windows Update, and on most Linux distributions, a package exists for CPU microcode. Most users have an easy—if not automatic—way to obtain the updates.

I believe the main issue with formally verifying everything is that reasoning about parallel code is extremely hard and might well make the entire endeavour unattainable. It would be great to have formally verified CPUs, though.

> They could've caught and fixed this one with some more testing, before actually releasing.

Isn't this true of any bug?

sqldba · on June 26, 2017

Windows update applies CPU patches?!

I know they did AMT but that was a special case and different.

lorenzhs · on June 26, 2017

Seems like they do when it fixes serious bugs, a quick search yields e.g. https://support.microsoft.com/en-us/help/3064209/june-2015-i... ("This update fixes several issues with Intel CPUs that could cause computer crashes or functions incorrectly.")

ComputerGuru · on June 25, 2017

If this "fix" ends up disabling HT, how can I get a refund not just for the CPU but for the $3k laptop I spec'd around it? Without needing to sue?

lorenzhs · on June 25, 2017

That's just the workaround until a fix is available for your CPU, if I understood correctly.

ComputerGuru · on June 25, 2017

No, that's not what I meant. In the past, Intel's microcode updates have simply disabled the broken functionality as an official solution.

http://techreport.com/news/26911/errata-prompts-intel-to-dis...

lorenzhs · on June 26, 2017

You specifically said the fix might be to permanently disable HT.

The TSX situation was indeed unfortunate, I have one of the affected CPUs. But it was a new feature that was broken, not something that used to work, so the bug's impact was less serious, and disabling the feature didn't have too much of an impact.

brianwawok · on June 25, 2017

Contact your laptop OEM.

voltagex_ · on June 25, 2017

Have you ever had to do this?

The Dell XPS 9350 (Skylake) has numerous issues with GPU noise, USB-C compatibility, wifi reliability - Dell don't care unless your consumer laws are strong enough to make them care.

brianwawok · on June 26, 2017

No but dell had a 30 day full refund policy. Did those bugs not start until day 31?

voltagex_ · on June 26, 2017

TBH, everything was covered by warranty other than the coil whine and wifi - by sticking to one particular driver version you can avoid many of the wifi issues. Not everyone has had the USB-C problems and it's gotten a little better with the most recent updates.

kutkloon7 · on June 25, 2017

Well, if a software bug leads to data corruption it is not really "allowed" as well. It is quite amazing that there are not more bugs like this. Intel spends an insane amount of money on very large teams of very smart people to work on these insanely complicated systems.

I don't think this is sustainable, and I think that massive arrays of relatively simple processors. First, we will need a culture shift that learns programmers to program concurrent programs from the start, and this will take a lot of time (because technology moves a lot faster than culture).

sqldba · on June 26, 2017

It would be nice if they allowed users to update rather than vendors. I have a mobile Xeon. I don't know if it's affected or not.

throwaway2048 · on June 25, 2017

There are hardware bugs that are impossible to fix in microcode. What then?

baq · on June 25, 2017

you make workarounds in software, obviously.

userbinator · on June 25, 2017

...or recall all the hardware and replace it:

https://en.wikipedia.org/wiki/Pentium_FDIV_bug

coldtea · on June 25, 2017

There will always be. What then? We stop producing cpus and wait for some magical new technology that prevents all bugs?

throwaway2048 · on June 25, 2017

Its why you can't treat hardware bugs and dismiss them flippantly like software ones.

AsyncAwait · on June 25, 2017

It's funny, since many of today's software bugs are due to the legacy of C, faults of which can often times be attributed to mimicking the behaviour CPU hardware too closely.

EDIT: My wording was poor, but what I meant was that it's not like hardware vendors do not make bugs happen due to deficiencies in their design, which are not only tolerated, but often reflected in software, which runs counter to OP saying that apparently hardware vendors are not allowed to have bugs, but the software ones are:

> So it's okay for software to have bugs that get fixed (I think everybody here acknowledges that software will always have bugs), but Intel isn't allowed to have issues in their processors

blahman2 · on June 25, 2017

Go back to rust-land now :D

AsyncAwait · on June 25, 2017

It was meant as a response to OP basically saying that hardware vendors make way fewer bugs that the software ones, which is true, but many bugs of software can arguably be attributed to hardware design, so it's not like they're without fault. Rust has nothing to do with it, not sure why even bring it up.

blahman2 · on June 25, 2017

I get it - sorry, meant to playfully Sunday afternoon troll you.

Ever since I moved from C to Java (I like low level stuff but the company I joined is such) I have been having one of three major problems - logic bugs, too many frameworks, causing bugs due to lack of in depth understanding of each one of them, and GC bottlenecks. Of all of them, I hate the GC ones the most.

bsder · on June 25, 2017

Formal verification of a decode unit simply will never happen. The complexity is far too high.

Formally verifying something like a multiplier block is difficult but doable if you care. Formally verifying an FPU is probably at about the limit.

If you want formal verification, you would have to simplify a modern microprocessor a lot.

wyager · on June 26, 2017

That's the whole point of doing a provably correct transformation from a simpler model domain. Even complicated OOOE engines are fairly tractable if you don't try to implement them at an RTL level...

Perhaps we are imagining different formal verification methodologies. Can you tell me what kind of formal verification you're referring to?