The absolute state of CPU design in 2021: https://www.theregister.com/2021/06/04/google_chip_flaws/
@psf I followed that through to the Google and FB papers; they're quite good reads; the FB one especially - that must have been a *fun* debug session. I'm suspecting most of these are silicon level problems rather than microarchitectural.
I remember we had a problem like this when running a few dozen pentium 933MHz machines: one cpu on one machine would give incorrect results for one particular run (out of thousands)
that would be in 2001 or so.
And earlier than that, I remember people working at Sun said they'd re-run any failing job (out of tens or hundreds of thousands) and only mark it as failed if the re-run (or possibily even the re-re-run) failed.
@EdS @psf First time I've heard of that on Pentiums; around the same time Sun did have a known problem with cache on one series of SPARCs; https://www.theregister.com/2001/03/07/sun_suffers_ultrasparc_ii_cache/
Having now read the fine article...
Our problem with the Pentium was probably a test escape: some flaw in the circuit, rarely but reliably triggered, and not covered in production test (or by other workloads)
Whereas this is less repeatable. Some machines fail, some of the time, while most do not. And the failures react to the environment.
Maybe today's very complex CPUs have more holes in test coverage. Tiny transistors and wires can be flawed in subtle ways.
Could be some ageing effect. Clearly not leading to an easily reproducible failure, though: leading to something unlikely but possible.
I did read that Intel is cutting down on validation effort, in which case they are designing-in more bugs:
"CPUs have gotten more complex, making them more difficult to test and audit effectively, while Intel appears to be cutting back on validation effort"
Search within for "sheer panic"
@theruran @psf Worth reading the original paper on defective CPUs. https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s01-hochschild.pdf
> A deterministic AES miscomputation, which was “self-inverting”: encrypting and decrypting on the same core yielded the identity function, but decryption else where yielded gibberish.
This reminds me of a story from one flatmate who repaired machine for a supplier who had a warehouse in North London.
Their work-flow process meant that the desktop machine would arrive in the warehouse to be stored there, before being sent ot the test-lab where my flatmate worked.
They would test the machines, then send the back to a different part of the warehouse before the machines were sent to the customer's offices.
When the machines were in the offices they would start crashing after 3 days.
So replacements were sent out, and the original machines would be sent back to the warehouse where they were stored until they could be tested again.
All of the tests came out fine, so they were sent back to the warehouse before being sent out to a different set of customers.
Rinse-And-Repeat for several iterations, before they got serious in trying to trace the problems.
Eventually someone noticed that the machines that were failing had a common element.
They were using +/-10% rated-value resistors.
When they started testing the resistors, they found that ALL of the resistor ratings were either -10% to -5% rated value, or +5% to +10% rated value.
NONE of them were in the centre bands.
That's when they worked out that the resistor manufacturer had been cherry-picking the resistors from the manufacturing process.
All of the most accurate resistors went into the +/-0.1% product line, the next batch wen into the =/-2% product line, then +/-5%, and the +/-10% product line that my flatmate came across.
He ordered batches directly from a range of resistor suppliers.
ALL of the resistor manufacturers were doing it.
If you wanted an accurately-specced resistor, you had to buy the most expensive resistors, otherwise your were just having to guess whether the components would work on the circuit boards.
The reason that the PC's were working in the test lab, but not the customer's offices, is that they didn;t get the chance to warm up enough, so that they would fail, as the warehouse was unheated, but the offices were room temperature.
It wouldn't surprise me if the CPU manufacturers were doing the same.
Test the chips and sell the most accurate verrsions at the highest prices, and have a set of band ranges for the rest.
I know that Intel WAS doing this in the early 90's, but changed the way they were doing things after they were sued by some banks that had spent a LOT of money buying the Math-Co-Processors, that failed if you pushed them too far.
Someone at the CPU manufacturer has fired the staff that knew this failure mode, and there's been a corresponding loss of institutional memory.
The CPU manufacturer has been banding the chips to increase their margins by creating differential product lines.
Someone at the computer manufacturer has been trying to improve their margins by buying the cheaper chips.
Someone at Google/FB has been shaving their costs by buying cheaper machines.
But this time it's operating at the remote data centre level, rather than the desktop PC level.
Time to benchmark every chip that you buy, and sue the maker if it's not up to spec.
Also time to start shorting the CPU maker's stock, as Google/FB have enough cash to effectively sue without settling. :D
My flatmate showed me the machines that he was working on, as well as showing me the results from the component tests that he performed. :D
He got a pay-raise from that, while the idiot who tried to cut the quality was made redundant.
That whole company was shuttered two years later, as no-one trusted that brand, so stopped buying their machines.
It may be folklore, but i saw it happen. :D
The social network of the future: No ads, no corporate surveillance, ethical design, and decentralization! Own your data with Mastodon!