Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

  The issue was being investigated by the OCaml community since
  2017-01-06, with reports of malfunctions going at least as far back as
  Q2 2016.  It was narrowed down to Skylake with hyper-threading, which is
  a strong indicative of a processor defect.  Intel was contacted about
  it, but did not provide further feedback as far as we know.
 
  Fast-forward a few months, and Mark Shinwell noticed the mention of a
  possible fix for a microcode defect with unknown hit-ratio in the
  intel-microcode package changelog.  He matched it to the issues the
  OCaml community were observing, verified that the microcode fix indeed
  solved the OCaml issue, and contacted the Debian maintainer about it.
 
  Apparently, Intel had indeed found the issue, *documented it* (see
  below) and *fixed it*.  There was no direct feedback to the OCaml
  people, so they only found about it later.

Inexcusable.


They forgot to follow up on a support ticket. As your quotation mentions, the issue was documented and fixed. Calling that "inexcusable" is a bit strong, don't you think?

I'm not a particularly big fan of Intel's practices, but the reactions in this thread seem a bit too strong to me.


It wouldn't be a big deal if this was just software, but the fact that Intel allowed a PROCESSOR bug to be reported, tested, and fixed without telling anyone that the bug actually exists is honestly horrible. You can't just let CPU bugs under the run since it can throw the stability and reliability of the entire system into question. The people that reported it shouldn't have to dig through Intel microcode updates and test different fixes to see if the bug they found was fixed. Hardware manufacturers (especially processor manufacturers) need to be held to a higher standard when it comes to bug reporting and this kind of behavior really has no excuse.


> Intel allowed a PROCESSOR bug to be reported, tested, and fixed without telling anyone that the bug actually exists

That's just not true. Intel published the erratum in April: https://www3.intel.com/content/dam/www/public/us/en/document... - search for SKL150. It was also clearly noted in Debian's intel-microcode changelog on May 15: http://metadata.ftp-master.debian.org/changelogs/non-free/i/...

They absolutely should have followed up to the OCaml people's support ticket. But sloppy followup is an issue that every large project encounters.


It's an excuse that is commonly given - but easily avoidable if you care enough about the ones that pay your salary by buying processors from your company.

And not only that, but they did much more than the average Joe and essentially pin-pointed the issue for them. So yeah, inexcusable it is.


Do we have any evidence that that's where Intel even learned about the bug?


I agree. It's totally possible that Intel found the bug in another bug report, fixed it, closed that bug report, and never realized that the OCaml bug was related.


What exactly do you find inexcusable here?


I don't mind Intel keeping very quiet about fixed-in-microcode bugs that don't directly affect valid userspace programs, like

« Instruction Fetch May Cause Machine Check if Page Size and Memory Type Was Changed Without Invalidation »

or

« Execution of VAESIMC or VAESKEYGENASSIST With An Illegal Value for VEX.vvvv May Produce a #NM Exception »

but something like this should be announced clearly.

(I keep my microcode packages up to date, but I don't normally bother rebooting when an update comes in.)


It was reported to Intel who fixed it and did not reply to the reporter.


The contempt for users. I know what I do when a user files a real bug: respond to them, acknowledge it's a problem, tell them when it's fixed.

The fact that Intel does not do that with a bug of this magnitude shows how much respect they have for their users.


It's totally plausible that Intel detected this bug independently with their own verification effort or through another customer. Matching different defect reports when "unexplained" or nondeterministic behavior is the expected result can be challenging.


'Contempt' is far too dramatic of a word. I think what you mean is 'indifference'.


>Inexcusable

What, the "found the issue, documented it, and fixed it" part?


We had been seeing this one on and off on some of our machines, and were already at least mentally pointing the finger to the LWT library. Turns out these machines were affected. That's one less worry.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: