The issue was being investigated by the OCaml community since
2017-01-06, with reports of malfunctions going at least as far back as
Q2 2016. It was narrowed down to Skylake with hyper-threading, which is
a strong indicative of a processor defect. Intel was contacted about
it, but did not provide further feedback as far as we know.
Fast-forward a few months, and Mark Shinwell noticed the mention of a
possible fix for a microcode defect with unknown hit-ratio in the
intel-microcode package changelog. He matched it to the issues the
OCaml community were observing, verified that the microcode fix indeed
solved the OCaml issue, and contacted the Debian maintainer about it.
Apparently, Intel had indeed found the issue, *documented it* (see
below) and *fixed it*. There was no direct feedback to the OCaml
people, so they only found about it later.
They forgot to follow up on a support ticket. As your quotation mentions, the issue was documented and fixed. Calling that "inexcusable" is a bit strong, don't you think?
I'm not a particularly big fan of Intel's practices, but the reactions in this thread seem a bit too strong to me.
It wouldn't be a big deal if this was just software, but the fact that Intel allowed a PROCESSOR bug to be reported, tested, and fixed without telling anyone that the bug actually exists is honestly horrible. You can't just let CPU bugs under the run since it can throw the stability and reliability of the entire system into question. The people that reported it shouldn't have to dig through Intel microcode updates and test different fixes to see if the bug they found was fixed. Hardware manufacturers (especially processor manufacturers) need to be held to a higher standard when it comes to bug reporting and this kind of behavior really has no excuse.
It's an excuse that is commonly given - but easily avoidable if you care enough about the ones that pay your salary by buying processors from your company.
And not only that, but they did much more than the average Joe and essentially pin-pointed the issue for them. So yeah, inexcusable it is.
I agree. It's totally possible that Intel found the bug in another bug report, fixed it, closed that bug report, and never realized that the OCaml bug was related.
It's totally plausible that Intel detected this bug independently with their own verification effort or through another customer. Matching different defect reports when "unexplained" or nondeterministic behavior is the expected result can be challenging.
We had been seeing this one on and off on some of our machines, and were already at least mentally pointing the finger to the LWT library. Turns out these machines were affected. That's one less worry.