@psf I followed that through to the Google and FB papers; they're quite good reads; the FB one especially - that must have been a *fun* debug session. I'm suspecting most of these are silicon level problems rather than microarchitectural.

I remember we had a problem like this when running a few dozen pentium 933MHz machines: one cpu on one machine would give incorrect results for one particular run (out of thousands)

that would be in 2001 or so.

And earlier than that, I remember people working at Sun said they'd re-run any failing job (out of tens or hundreds of thousands) and only mark it as failed if the re-run (or possibily even the re-re-run) failed.

@penguin42 @psf

@EdS @psf First time I've heard of that on Pentiums; around the same time Sun did have a known problem with cache on one series of SPARCs; theregister.com/2001/03/07/sun

Having now read the fine article...

Our problem with the Pentium was probably a test escape: some flaw in the circuit, rarely but reliably triggered, and not covered in production test (or by other workloads)

Whereas this is less repeatable. Some machines fail, some of the time, while most do not. And the failures react to the environment.

Maybe today's very complex CPUs have more holes in test coverage. Tiny transistors and wires can be flawed in subtle ways.

@penguin42 @psf

My suspicion is it's not test cases missing, but either degradation with age (I remember reading about Electromigration: en.wikipedia.org/wiki/Electrom ) , or timing/voltage dependencies - e.g. not being quite fast enough/strong enough at low voltages.
@EdS @psf

Could be some ageing effect. Clearly not leading to an easily reproducible failure, though: leading to something unlikely but possible.

I did read that Intel is cutting down on validation effort, in which case they are designing-in more bugs:
"CPUs have gotten more complex, making them more difficult to test and audit effectively, while Intel appears to be cutting back on validation effort"

danluu.com/cpu-bugs/

Search within for "sheer panic"

@penguin42 @psf

@EdS @psf Yep, but actual bugs worry me in a different way - something that reliably does the wrong thing is a bit easier to think about; core/speed specific ones are nasty though and ones specific to ageing/failing are even worse.

@theruran @psf Worth reading the original paper on defective CPUs. sigops.org/s/conferences/hotos

> A deterministic AES miscomputation, which was “self-inverting”: encrypting and decrypting on the same core yielded the identity function, but decryption else where yielded gibberish.

@niconiconi @psf yeah - finding out that the only computer that can decrypt and access your data is the one with a #mercurialCore that originally encrypted it is 'fun.'
/cc @stman

@stman @theruran @yaaps @niconiconi @psf

This reminds me of a story from one flatmate who repaired machine for a supplier who had a warehouse in North London.

Their work-flow process meant that the desktop machine would arrive in the warehouse to be stored there, before being sent ot the test-lab where my flatmate worked.

They would test the machines, then send the back to a different part of the warehouse before the machines were sent to the customer's offices.

@stman @theruran @yaaps @niconiconi @psf

When the machines were in the offices they would start crashing after 3 days.

So replacements were sent out, and the original machines would be sent back to the warehouse where they were stored until they could be tested again.

All of the tests came out fine, so they were sent back to the warehouse before being sent out to a different set of customers.

Rinse-And-Repeat for several iterations, before they got serious in trying to trace the problems.

@stman @theruran @yaaps @niconiconi @psf

Eventually someone noticed that the machines that were failing had a common element.

They were using +/-10% rated-value resistors.

When they started testing the resistors, they found that ALL of the resistor ratings were either -10% to -5% rated value, or +5% to +10% rated value.

NONE of them were in the centre bands.

@stman @theruran @yaaps @niconiconi @psf

That's when they worked out that the resistor manufacturer had been cherry-picking the resistors from the manufacturing process.

All of the most accurate resistors went into the +/-0.1% product line, the next batch wen into the =/-2% product line, then +/-5%, and the +/-10% product line that my flatmate came across.

He ordered batches directly from a range of resistor suppliers.

ALL of the resistor manufacturers were doing it.

@stman @theruran @yaaps @niconiconi @psf

If you wanted an accurately-specced resistor, you had to buy the most expensive resistors, otherwise your were just having to guess whether the components would work on the circuit boards.

The reason that the PC's were working in the test lab, but not the customer's offices, is that they didn;t get the chance to warm up enough, so that they would fail, as the warehouse was unheated, but the offices were room temperature.

@stman @theruran @yaaps @niconiconi @psf

It wouldn't surprise me if the CPU manufacturers were doing the same.

Test the chips and sell the most accurate verrsions at the highest prices, and have a set of band ranges for the rest.

I know that Intel WAS doing this in the early 90's, but changed the way they were doing things after they were sued by some banks that had spent a LOT of money buying the Math-Co-Processors, that failed if you pushed them too far.

@stman @theruran @yaaps @niconiconi @psf

Someone at the CPU manufacturer has fired the staff that knew this failure mode, and there's been a corresponding loss of institutional memory.

The CPU manufacturer has been banding the chips to increase their margins by creating differential product lines.

Someone at the computer manufacturer has been trying to improve their margins by buying the cheaper chips.

Someone at Google/FB has been shaving their costs by buying cheaper machines.

@stman @theruran @yaaps @niconiconi @psf

But this time it's operating at the remote data centre level, rather than the desktop PC level.

Time to benchmark every chip that you buy, and sue the maker if it's not up to spec.

Also time to start shorting the CPU maker's stock, as Google/FB have enough cash to effectively sue without settling. :D

@BillySmith @stman @theruran @yaaps @psf I won't be surprised if I see a comprehensive CPU test suite or online monitoring tool on GitHub by Google or Facebook a few years later.

@BillySmith @stman @theruran @yaaps @psf The resistor tolerance story is a classic in the electronics folklore. Everyone will eventually hear different variations of the same story (or if you are unlucky, has first-hand experience) after getting into electronics for a few years.

@niconiconi @stman @theruran @yaaps @psf

My flatmate showed me the machines that he was working on, as well as showing me the results from the component tests that he performed. :D

He got a pay-raise from that, while the idiot who tried to cut the quality was made redundant.

That whole company was shuttered two years later, as no-one trusted that brand, so stopped buying their machines.

It may be folklore, but i saw it happen. :D

@BillySmith @stman @theruran @yaaps @psf > as no-one trusted that brand, so stopped buying their machines.
There's a saying for this - "worst-case tolerances never add - but when they do, they are found in the best customer's machine." gunkies.org/wiki/Vonada%27s_En

@theruran @psf Very interesting. Ghost in the shell + iRobot scenario. By the way, not even surprised.

The current cyberspace as a whole is definitely chaotistic and those processors or motherboards are far from being deterministic. It's just another confirmation.

@psf As I always say it's a miracle to me that computers even work. :D

@psf nothing changes - Pentium FP bug anyone?

Sign in to participate in the conversation
OldBytes Space - Mastodon

The social network of the future: No ads, no corporate surveillance, ethical design, and decentralization! Own your data with Mastodon!