Thursday, March 19, 2015

tandem computers!!!

Interesting bits from “Why Do Computers Stop and What Can Be Done About It?”

Why Do Computers Stop and What Can Be Done About It? is a technical report by Jim Gray from Tandem Computers, written in 1985. It’s what I’d call “an oldie but a goldie”. Lots of insights from production systems, very likely still applicable today, but mostly forgotten and ignored.

The paper looks at what Tandem Computers did right when it came to providing high availability systems (both hardware and software) in real-world settings. The paper opens with a rather unsurprising, but still interesting:

An analysis of the failure statistics of a commercially available fault-tolerant system shows that administration and software are the major contributors to failure.

More damning numbers are brought forth in the introduction:

Conventional well-managed transaction processing systems fail about once every two weeks

Which is quite a lot, considering they give an averaged itme of 90 minutes per outage.

Reliability, Availability, and other Definitions

The following terms are defined in the report:

Availability is doing the right thing within the specified response time.

Reliability is not doing the wrong thing.

Expected reliability is proportional to the Mean Time Between Failures (MTBF).

A failure has some Mean Time To Repair (MTTR).

Availability can be expressed as a probability that the system will be available: Availability = MTBF / (MTBF+MTTR).

Note that availability is considered to account for partial failures too:

In distributed systems, some parts may be available while others are not. In these situations, one weights the availability of all the devices (e.g. if 90% of the database is to 90% of the terminals, then the system is .9x.9 = 81% available.)

Hardware Availability

The report mentions hardware availability being much simpler to obtain (and if not simpler, it is simply done better in the wild):

The key to providing high availability is to modularize the system so that modules are the unit of failure and replacement. Spare modules are configured to give the appearance of instantaneous repalr if MTTR is tiny, then the failure is “seen” as a delay rather than a failure.

They give the example of ATMs frequently being placed in pairs in some areas, so that if one of them fails, people can still deposit or withdraw money at that location. The service is degraded, but it’s still working and available. This gives time for the support teams to fix the hardware without interrupting service:

modularity and redundancy allows one module of the system to fail without affecting the availability of the system as a whole because redundancy leads to small MTTR. This combination of modularity and redundancy is the key to providing continuous service even if some components fail. […] modern computer systems are constructed in a modular fashion a failure within a module only affects that module. In addition each module is constructed to be fail-fast — the module either functions properly or stops. Combining redundancy with modularity allows one to use a redundancy of two rather than 20,000.

More concisely, the following list is given:

Hierarchically decompose the system into modules.

Design the modules to have MTBF in excess of a year.

Make each module fail-fast — either it does the right thing or stops.

Detect module faults promptly by having the module signal failure or by requiring it to periodically send an I AM ALIVE message or reset a watchdog timer.

Configure extra modules which can pick up the load of failed modules. Takeover time, including the detection of the module failure, should be seconds. This gives an apparent module MTBF measured in millennia.

Software is the problem

Statistics are the same for the Tandem systems (redundant, somewhat distributed) compared to the biggets source of computing power back in the day, mainframes:

Administration and software dominate, hardware and environment are minor to total system outages.

This is taken from over 10,000,000 system hours spread over 2,000 systems, or over 1,300 system years.

roughly 30% of the faults reported are related to new code or hardware being introduced:

About one third of the failures were “infant mortality” failures — a product having a recurring problem. All these fault clusters are related to a new software or hardware product still having the bugs shaken out.

42% of the time is lost due to sysadmin tasks:

System administration, which includes operator actions, system configuration, and system maintenance was the main source of failures — 42%. Software and hardware maintenance was the largest category.

However, Lispers and Erlangers can start being smug: live maintenance is great!

High availability systems allow users to add software and hardware and to do preventative maintenance while the system is operating. By and large, online maintenance works VERY well. It extends system availability by two orders of magnitude.

This makes it unsurprising that humans are still one of the major sources of failures — something going wrong while maintaining. What’s more interesting is that the administrators at Tandem were really solid people:

But occasionally, once every 52 years by my figures, something goes wrong. […] The notion that mere humans make a single critical mistake every few decades amazed me — clearly these people are very careful and the design tolerates some human faults.

The following sources of downtime are reported:

Administration: 42% (31 years)

Maintenance: 25%

Operations: 9% (likely under-reported)

Configuration: 8%

Software: 25% (50 years)

Vendor: 21%

Application: 4% (likely under-reported)

Hardware: 18% (73 years)

Central: 1%

Disc: 7%

Tape: 2%

Comm Controllers: 6%

Power Supply: 2 %

Environment: 14% (87 years)

Power: 9% (likely under-reported)

Communications: 3%

Facilities: 2%

Unknown: 3%

For a total of 103%, or 11 system-years between each failure. The author suspects that most software errors are also under-reported, as noted in the list above:

I guess that only 30% are reported. If that is true, application programs contribute 12% to outages and software rises to 30% of the total.

Not looking good for developers.

How to fix it and get high availability

The implications of these statistics are clear: the key to high-availability is tolerating operations and software faults.

To make it short, your systems should be built in a way that makes it hard for operators to break things accidentally (don’t trust the operator), and you should expect software to have bugs. You don’t try to prevent them (well you still should), but you should try to be able to function even with the presence of bugs.

Fixing things for operators

[…] reduce administrative mistakes by making self-configured systems with minimal maintenance and minimal operator interaction. […] Maintenance interfaces must be simplified.

On top of this, the report notes that two major sources of outages are:

Installing and deploying new software

Not deploying software fixes for known bugs.

This kind of paradox is messy, but in the end, the following recommendation is made:

Software fixes outnumber hardware fixes by several orders of magnitude. […] [install] a software fix only if the bug is causing outages. Otherwise, [wait] for a major software release, and carefully test it In the target environment prior to installation. […] if availability is a major goal, then avoid products which are immature and still suffering infant mortality. It is fine to be on the leading edge of but avoid the bleeding edge of technology.

Fixing things for software

The keys to this software fault-tolerance are:

Software modularity through processes and messages.

Fault containment through fail-fast software modules.

Process-pairs to tolerate hardware and transient software faults.

Transaction mechanism to provide data and message integrity

Transaction mechanism combined with process-pairs to ease exception handling and tolerate software faults.

the key to software fault-tolerance is to hierarchically decompose large systems into modules, each module being a unit of service and a unit of failure. A failure of a module does not propagate beyond the module.

The rest of the paper mostly focuses on defining these terms and how to make it work.

Fail-fast software modules and Heisenbugs

The idea is always to fail ASAP:

[Fail-fast modules] check all their inputs, intermediate results, outputs and data structures as a matter of course. If any error is detected, they signal a failure and stop. In the terminology of [Cristian], fail-fast software has small fault detection latency. The process achieves fault containment by sharing no state with other processes; rather, its only contact with other processes is via messages carried by a kernel message system.

This is related to Heisenbugs (transient bugs solved by trying again) as follows:

most production software faults are soft. If the program state is reinitialized and the failed operation retried, the operation will usually not fail the second time.

The assumption is that most hard bugs are weeded out rather early, and that residual bugs are often circumstancial (hardware, limit conditions [no space left], race conditions, etc.):

In these cases, resetting the program to a quiescent state and reexecuting it will quite likely work, because now the environment is slightly different. After all, it worked a minute ago! […] The assertion that most production software bugs are soft Heisenbugs that go away when you look at them is well known to systems programmers. Bohrbugs, like the Bohr atom, are solid, easily detected by standard techniques, and hence boring. But Heisenbugs may elude a bugcatcher for years of execution. Indeed, the bugcatcher may perturb the situation just enough to make the Heisenbug disappear. This is analogous to the Heisenberg Uncertainty Principle in Physics.

In fact, the paper reports that only 1/132 bugs in a given period of time were not Heisenbugs.

Hence, low-latency fault-detection, failing fast, and trying again from a clean state is often enough to fix production issues.

Process Pairs for fault-tolerant execution

Restarting alone is usually not enough, unless you manage to lower MTTR to milliseconds. The paper recommends using process pairs to do so:

configuring extra software modules gives a MTTR of milliseconds in case a process fails due to hardware failure or a software Heisenbug. If modules have a MTBF of a year, then dual processes give very acceptable MTBF for the pair. Process triples do not improve MTBF because other parts of the system (e.g., operators) have orders of magnitude worse MTBF.

The paper mentions multiple approaches:

Lockstep: run the same instructions on two processes on two different processors — this doesn’t work for Heisenbugs, only hardware failures

State Checkpointings: a master and a backup process exist. The master sends safe state checkpoints to the backup. If the master fails, the backup takes over. The author mentions this being very efficient, but difficult to program. The backups are physical — duplication of messages and whatnot

Automatic Checkpointing: same as before, but the kernel does the checkpointing automatically. The problem is that it has high execution costs.

Delta checkpointing: logical checkpoints are sent to the backup process. The back-up is logical, decoupled, and tends to send less data around. Less likely to get a corrupted pair.

Persistence: Only one process runs at a time, but the state is persisted at logical points. In case of failure, the backup is brought back blank, loads the persisted state, and runs from there. It tends to be very lightweight and simple to program compared to other ways. They do tend to lose some transient state in case of failure, though.

The author argues that persistence with transactions is the best solution.

Defining Transactions

A transaction is a group of operations, be they database updates, messages, or external actions of the computer, which form a consistent transformation of the state.

Transactions should have the ACID property


Transactions relieve the application programmer of handling many error conditions. If things get too complicated, the programmer (or the system) calls AbortTransaction which cleans up the state by resetting everything back to the beginning of the transaction.

How to make it work

Transactions do not directly provide high system availability. If hardware fails or if there is a software fault, most transaction processing systems stop and go through a system restart


The “easy” process-pairs, persistent process-pairs, have amnesia when the primary fails and the backup takes over. Persistent process-pairs leave the network and the database in an unknown state when the backup takes over.

It can be seen that putting them together may be the solution to the problem:

we can simply abort all uncommitted transactions associated with a failed persistent process and then restart these transactions from their input messages. This cleans up the database and system states, resetting them to the point at which the transaction began.

This allows to fix the issues of persistent process pairs while keeping their reliability against Heisenbugs and failures.

Other items

Communication is shaky. Sessions and alternative paths are a good idea. This (to me) sounds a lot like using TCP/IP to transmit data.

Storage should be kept on different disks, possibly media, and ideally in different physical locations. “this will protect against 75% of the failures (all the non-software failures). Since it also gives excellent protection against Heisenbugs, remote replication guards against most software faults.”

Allow atomic modifications to data across locations, but also provide geographical partitionning of data for better failure tolerance. This is a principle seen, if I recall, in Dynamo-like databases and ‘NewSQL’ databases such as VoltDB.


That’s about it for the practices and highlights from the paper. Keep hardware going, limit how much interaction is required from operators, use processes with isolated memory, put them in pair with persistent state that works with transactions, and you should be on your way.


from lizard's ghost

No comments:

Post a Comment