Petroski Notes

Random bits about the book "To Forgive Design" 2012 by Henry Petroski, and related ideas. I write notes like this on my wiki so I can mine the text for my own later writings (and people can correct me before I do).


Book Notes

No complaints - just notes to help me remember what I found in the book, or what occurred to me while I read it.


Reliability and failure in chip design

Electronic product design accumulates failure experience more rapidly but more inconsistently than structural engineering. In a throwaway culture, a product is often discarded before it has a chance to fail. Exceptions include instrumentation (see Vintage Tektronix below), ships and planes, automotive, and communications central plant electronics, but even here the replacement cycles are rarely more than 20 years, and also obsolescence driven. Usually, the electronics is more reliable than the software running on it, so errors in software receive more attention ( properly so, though with less success ).

Electronics for implantable medical products (pacemakers, etc.) must have super high reliability, but medical volumes are small and the duration of use can be as short as hours (for catheterized imaging sensors, for example). The very high liability exposure often means that mainstream manufacturers actively avoid these markets - for example, Motorola sold piezo sensors for land mines, but refused to sell them for defibrillating pacemakers. Electronics for satellites require high reliability, but direct forensic analysis of failures is usually impossible; these failures are often inferred via telemetry.

Electronic systems are cheap and abundant enough to permit sampling and test to destruction - components and whole systems. Heat accelerates the chemical and thermal expansion processes that drive most failures, so electronics are tested with temperature cycling ( as much as -50C to 150C ) and month-duration high temperature testing ("life test") of hundreds of samples. Failure doubles for every increase of 10C, so a product baked at 150C for a month emulates a product operating at 50C for 80 years. These temperatures destroy or warp plastics, liquid crystal displays, etc., so forensics can be incomplete.

Sadly, some manufacturers of consumer electronics perform no testing at all, beyond making a few prototypes out of one batch of prototype parts and not finding egregious failures. We sometimes ironically call this cruelty-free - there's no testing. Like the animal non-testing equivalent, the customers become the guinea pigs. OTOH, companies like Apple perform hundreds of tests during product assembly in their Chinese manufacturing plants. Finding defects early, and correcting processes quickly, increases product quality and reduces scrap rate.

Some products are designed with built-in telemetry, making factory test easier, but very helpful for post-failure forensics. The IEEE 1149 family of test standards reduce the number of (unreliable) contact probes necessary to measure a product during manufacture. My company SiidTech designs and licenses identification circuits that are built into the chips, allowing failed chips to be compared to saved individual testing data on the wafer. Our clients use this to improve their tests (reducing future failures), detect otherwise anonymous chips whose parameters change during manufacturing (they will probably change past failure in the field), and even discard assembled chips that were neighbors to failed chips on the wafer (the neighbors tend to fail faster).

The greatest value of failure data is process improvement. Like all engineering disciplines, failed components teach us where manufacturing is inadequate. Non-failures teach us to be more aggressive, reducing costs and pushing the performance of future components. Cell phones and microprocessors are pushed to the limits - a cell phone can be more power efficient if smaller power amplifier transistors are used with higher voltages and temperatures. They degrade faster in the field, but most consumers lose, break, or replace their cell phones before this happens. Computer CPUs are stressed similarly; customers demand performance, and old computers grow uselessly obsolete, so computers are optimized for maximum performance over 5 year lifetimes. Hobbyist "overclockers" eke out a few percent more speed while reducing lifetimes to months. The heat and increased current pushes metal atoms in conductors and creates cracks and voids - "electromigration".

Electronics have become so reliable that politicians push the other way. In order to prevent lead in landfills, RoHS ( Reduction of Hazardous Substances, "row-hoss") rules forbid lead in solder, moving to other "non-eutectic" tin alloys. These alloys do not flow like solder, and have high stress at grain boundaries in the metal. The material relieves stress by forming "whiskers" (long, spindly crystals) that reach out from solder joints, shorting to neighboring connections. These have caused failures in aircraft and satellites - crashes and derelict satellites don't add e-waste to landfills, either, perhaps an occasional pilot to a cemetery.

Perhaps the politicians are also increasing death tolls for structural engineering, with aesthetic and construction "convenience" demands. Airport noise reduction and fuel economy changes how aircraft engines and airframes are designed; the Rolls Royce engines for Boeing's 787 Dreamliner are engineering marvels, but they are pushing into new territory. International politics demand that components are made in China, Japan, and elsewhere, often by inexperienced manufacturers. The Dreamliner's carbon fiber wing roots turned out much weaker than planned, delaying deployment of the plane for years (fortunately, during a order-delaying recession). Boeing may pull all this off without increased failures in the field, they are very good engineers. But Airbus must push performance even more than Boeing to re-capture market. As these two giants vie for market leadership, we may learn we have pushed too hard as these planes age, fail, and passengers die.


Abusing statistical distributions

Most phenomena don't fit gaussian curves. Measuring a few samples, computing a mean and deviation, then extrapolating to large deviations drastically underestimates the probabilities at those extremes. Almost all real measured distributions from many samples have fat tails, kurtosis, much higher probabilities for large sigma. On some of my circuits, I've built over one million samples (thousands of circuits per integrated circuit die) to characterize the extremes, verifying that there were many larger deviations than a naive bell curve would suggest. The usual case is that there are common variations which sum statistically, as well as rare defects that add a large amount of variation to a small subset of samples.

Nicholas Taleb's book "The Black Swan" describes this as the occasional event that falls way outside normal variation, in finance and in life. Between 1920 and 1975, his home country of Lebanon was peaceful, prosperous, and ethnically diverse. In 1975, civil war broke out and some of his high school friends were now trying to kill him. This formed his later investment (and life) strategy - hide and read, don't assume business as usual, prepare for major catastrophes. The book is a charming romp through the inevitability of the unexpected - such as the black swans used by European philosophers as examples of logical absurdities - until they were discovered in Australia.

Bart Kosko's book "Noise" is an electrical engineering take on the same issues - this time signals on a wire, or variations in parameters when measuring batches of parts. The outliers are always there. Even more mundane behaviors like Poisson distributions show that if one event happens with small probability over a short time, there is a significant probability of many events happening during a similar short time sometime in the future. For example, if there is a 10% probability of a piece of debris running into your space elevator every orbit, it seems like you can dodge it. But after 20 years and 100,000 orbits, there will be 17 orbits where you have to dodge 3 or more debris pieces during the same orbit. Dodging 3 objects nearly simultaneously may be beyond the capacity of your space elevator to dodge. This Poisson "many things going wrong at once" behavior is characteristic of complicated systems running continuously for a long time. If your system can't be allowed to fail, you need a lot of spare capacity to deal with more than one problem at a time.


Vintage Tektronix oscilloscopes

In the 50's through the 70's, Tektronix designed oscilloscopes to last a very long time. Some have, and are displayed at the Vintage Tek museum, between Beaverton and Portland. Normal museum hours are Friday and Saturday, but I imagine that special hours could be arranged for visiting dignitaries. During normal hours, many of the retired engineers who originally designed this equipment can be found in the back room, repairing units for display. A don't-miss opportunity for engineering historians.

The very durability of vintage Tektronix equipment means that there is a lot of it on the surplus market, and mouldering in instrument storage at hundreds of companies. One of my dreams is that universities could bring these old instruments and old engineers to campus for a two week "gross anatomy" course for seniors or grad students, taking these old instruments apart while explaining the engineering decisions that went into them.

I sometimes run a junior version of this, bringing a few hundred pounds of old electronics and a dozen sets of tools and safety goggles to a weekend academy setting, letting kids from 6 to 12 take stuff apart and see how it is put together (6 year olds love taking apart deskset telephones). I've had the dubious honor of watching a mom drag her daughter away kicking and screaming - the girl wanted to take stuff apart all night. Most kids don't care about how things are made. But the future engineers care a Whole Lot. Supplying them with scraps to take apart and rebuild (and teaching them to do it semi-safely) should be a priority for our profession. There is a lot of scrap wood that should be going the future structural engineers out there - how can we foster their projects while keeping the kids out of the emergency room, and ourselves out of court?


Robert Courland's "Concrete Planet"

I would love to see a review by a concrete engineer - this book seemed a little "selective" about concrete failures. Book notes here


Cathodic protection of rebar

Concrete Planet makes much of the rusting of rebar in old concrete structures, and the trillions of dollars that may be necessary to replace thousands of them in the near future. I don't know how real the problem is. If it indeed a serious problem, we ought to be looking for ways to slow or stop this decay.

Rebar iron won't oxidize nearly as fast if it is biased cathodically. Too much voltage, and the electrode will emit hydrogen, which can embrittle the iron. Without knowing the voltage drop through the concrete, it is difficult to bias the iron properly.

A proposed invention

I'm not interested in patenting the following - consider it public domain. If it makes sense, please use it to protect America's aging reinforced concrete, saving lives and tax dollars. Perhaps a joint project between the CE and ME departments at Duke could get some papers out of it. I can imagine this turning into a little circuit board with a solar cell, a battery, and a small integrated circuit with a simple bluetooth radio transceiver for status logging and bias calibration, mass produced and added by the millions to reinforced concrete structures.


If you live in a place like Oregon with lots of rain and humidity, and have cheap telephone jacks with inadequate gold plating, you've listened to anodic oxidation of copper - it makes a hissing noise. These may be the little electrochemical events of copper ions oxidizing, small voltage spikes adding up to electrical noise.

The current drawn from a cathodic bias circuit on embedded rebar may make a similar noise, if the voltage is inadequate to prevent oxidation events. If the current is detected with a "virtual ground" amplifier, the capacitance of a few hundred feet of rebar can be nulled, and the audio band noise measured. Of course, the circuit will need a lot of filtering to reject radio noise and other antenna pickup.

I also presume that hydrogen generation is a higher energy phenomena, and has a different noise spectrum. So proper bias on a rebar electrode will be between the "copper oxidation hiss" and the "hydrogen hiss". It would be instructive to digitize the electronic spectrum of a piece of rebar in concrete as a function of voltage, and see if these hisses can indeed be distinguished, and a bias point chosen for the optimum tradeoff between hydrogen and oxidation.

If the voltage oscillates, we might learn about the migration of evolved hydrogen. Will it react with the iron oxide, reducing it back to iron and water? Can we learn about the depletion of alkalinity in the concrete around the rebar? We may learn to coat the rebar with materials that work in conjunction with cathodic protection to improve survival still further. These processes could be emulated at small scale in the materials lab, perhaps combined with high temperature lifetest acceleration.

I do not know how well this would work for large expanses of interconnected rebar - chances are there will be DC voltage gradients across the structure, so some portions of the electrode will be oxidizing while others will be reducing. While this technique might still help prolong the lifetime of existing reinforced concrete, it will probably work best on rebar electrodes carefully designed to match areas of similar ambient electrical potential and long term water infiltration and pH changes. So new designs will eventually get finite element electric field analysis, as well as mechanical analysis.

It may also be possible to find patches of damaged rebar by using large electrode plates on the surface of the structure to change the electric fields, then look at spectrum changes. Perhaps external patch electrodes could be added to parts of an existing structure's surface to remediate or stop damage. Ugly, but not as ugly as a bridge collapsed into a river.

PetroskiNotes (last edited 2012-08-03 15:26:12 by KeithLofstrom)