Reliability, availability and serviceability

Reliability, availability and serviceability, also known as reliability, availability, and maintainability, is a computer hardware engineering term involving reliability engineering, high availability, and serviceability design. The phrase was originally used by International Business Machines as a term to describe the robustness of their mainframe computers.
Computers designed with higher levels of RAS have many features that protect data integrity and help them stay available for long periods of time without failure This data integrity and uptime is a particular selling point for mainframes and fault-tolerant systems.

Definitions

While RAS originated as a hardware-oriented term, systems thinking has extended the concept of reliability-availability-serviceability to systems in general, including software.

Reliability can be defined as the probability that a system will produce correct outputs up to some given time t. Reliability is enhanced by features that help to avoid, detect and repair hardware faults. A reliable system does not silently continue and deliver results that include uncorrected corrupted data. Instead, it detects and, if possible, corrects the corruption, for example: by retrying an operation for transient or intermittent errors, or else, for uncorrectable errors, isolating the fault and reporting it to higher-level recovery mechanisms, or else by halting the affected program or the entire system and reporting the corruption. Reliability can be characterized in terms of mean time between failures, with reliability = exp.
Availability means the probability that a system is operational at a given time, i.e. the amount of time a device is actually operating as the percentage of total time it should be operating. High-availability systems may report availability in terms of minutes or hours of downtime per year. Availability features allow the system to stay operational even when faults do occur. A highly available system would disable the malfunctioning portion and continue operating at a reduced capacity. In contrast, a less capable system might crash and become totally nonoperational. Availability is typically given as a percentage of the time a system is expected to be available, e.g., 99.999 percent.
Serviceability or maintainability is the simplicity and speed with which a system can be repaired or maintained; if the time to repair a failed system increases, then availability will decrease. Serviceability includes various methods of easily diagnosing the system when problems arise. Early detection of faults can decrease or avoid system downtime. For example, some enterprise systems can automatically call a service center when the system experiences a system fault. The traditional focus has been on making the correct repairs with as little disruption to normal operations as possible.

Note the distinction between reliability and availability: reliability measures the ability of a system to function correctly, including avoiding data corruption, whereas availability measures how often the system is available for use, even though it may not be functioning correctly. For example, a server may run forever and so have ideal availability, but may be unreliable, with frequent data corruption.

Failure types

Physical faults can be temporary or permanent.

Permanent faults lead to a continuing error and are typically due to some physical failure such as metal electromigration or dielectric breakdown.
Temporary faults include transient and intermittent faults.
* Transient faults lead to independent one-time errors and are not due to permanent hardware faults: examples include alpha particles flipping a memory bit, electromagnetic noise, or power-supply fluctuations.
* Intermittent faults occur due to a weak system component, e.g. circuit parameters degrading, leading to errors that are likely to recur.
Failure responses

Transient and intermittent faults can typically be handled by detection and correction by e.g., ECC codes or instruction replay. Permanent faults will lead to uncorrectable errors which can be handled by replacement by duplicate hardware, e.g., processor sparing, or by the passing of the uncorrectable error to high level recovery mechanisms. A successfully corrected intermittent fault can also be reported to the operating system to provide information for predictive failure analysis.

Hardware features

Example hardware features for improving RAS include the following, listed by subsystem:

Processor:
* Processor instruction error detection with instruction retry e.g. alternative processor recovery in IBM mainframes, or "Instruction replay technology" in Itanium systems.
* Processors running in lock-step to perform master-checker or voting schemes.
* Machine check architecture to report errors to the OS.
Memory:
* Parity or ECC protection of memory components, and memory bus; bad cache line disabling; memory scrubbing; memory sparing; bad page offlining; redundant bit steering; redundant array of independent memory.
I/O:
* Cyclic redundancy check checksums for data transmission/retry and data storage, e.g. PCI Express Advanced Error Reporting, redundant I/O paths.
Storage:
* RAID configurations for magnetic disk storage.
* Journaling file systems for file repair after crashes.
* Checksums on both data and metadata, and background scrubbing.
Power/cooling:
* Duplicating components to avoid single points of failure, e.g., power-supplies.
* Over-designing the system for the specified operating ranges of clock frequency, temperature, voltage, vibration.
* Temperature sensors to throttle operating frequency when temperature goes out of specification.
* Surge protector, uninterruptible power supply, auxiliary power.
System:
* Hot swapping of components: processors, memories
* Predictive failure analysis to predict which intermittent correctable errors will lead eventually to hard non-correctable errors.
* Partitioning/domaining of computer components to allow one large system to act as several smaller systems.
* Virtual machines to decrease the severity of operating system software faults.
* Redundant I/O domains or I/O partitions for providing virtual I/O to guest virtual machines.
* Computer clustering capability with failover capability, for complete redundancy of hardware and software.
* Dynamic software updating to avoid the need to reboot the system for a kernel software update, for example Ksplice under Linux.
* Independent service processor for serviceability: remote monitoring, alerting and control.

Fault-tolerant designs extended the idea by making RAS to be the defining feature of their computers for applications like stock market exchanges or air traffic control, where system crashes would be catastrophic. Fault-tolerant computers, which tend to have duplicate components running in lock-step for reliability, have become less popular, due to their high cost. High availability systems, using distributed computing techniques like computer clusters, are often used as cheaper alternatives.

Popular movies

The Hunger Games (film) - 2012 American dystopian action thriller science fiction-adventure film directed by Gary Ross and based on Suzanne Collins’s 2008 novel of the same name. It is the first insta...
untitled Captain Marvel sequel - part of Marvel Cinematic Universe....
Killers of the Flower Moon (film project) - Killers of the Flower Moon - film project in United States of America. It was presented as drama, detective fiction, thriller. The film project starred Leonardo Dicaprio, Robert De Niro. Director of...
Five Nights at Freddy's (film) - Five Nights at Freddy's - film published in 2017 in United States of America. Scenarist of the film - Scott Cawthon....

Popular books

Book of Revelation - The Book of Revelation is the final book of the New Testament, and consequently is also the final book of the Christian Bible. Its title is derived from the first word of the Koine Greek text: apok...
Book of Genesis - account of the creation of the world, the early history of humanity, Israel's ancestors and the origins...
Gospel of Matthew - The Gospel According to Matthew is the first book of the New Testament and one of the three synoptic gospels. It tells how Israel's Messiah, rejected and executed in Israel, pronounces judgement on ...
Michelin Guide - Michelin Guides are a series of guide books published by the French tyre company Michelin for more than a century. The term normally refers to the annually published Michelin Red Guide , the oldest...
Psalms - The Book of Psalms , commonly referred to simply as Psalms , the Psalter or "the Psalms", is the first book of the Ketuvim , the third section of the Hebrew Bible, and thus a book of th...
Ecclesiastes - Ecclesiastes is one of 24 books of the Tanakh , where it is classified as one of the Ketuvim . Originally written c. 450–200 BCE, it is also among the canonical Wisdom literature of the Old Tes...
The 48 Laws of Power - non-fiction book by American author Robert Greene. The book...

Popular television series

The Crown (TV series) - historical drama web television series about the reign of Queen Elizabeth II, created and principally written by Peter Morgan, and produced by Left Bank Pictures and Sony Pictures Tel...
Friends - American sitcom television series, created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons. With an ensemble cast sta...
Young Sheldon - spin-off prequel to The Big Bang Theory and begins with the character Sheldon...
Modern Family - American television mockumentary family sitcom created by Christopher Lloyd and Steven Levitan for the American Broadcasting Company. It ran for eleven seasons, from September 23...
Loki (TV series) - upcoming American web television miniseries created for Disney+ by Michael Waldron, based on the Marvel Comics character of the same name. It is set in the Marvel Cinematic Universe, shar...
Game of Thrones - American fantasy drama television series created by David Benioff and D. B. Weiss for HBO. It...
Shameless (American TV series) - American comedy-drama television series developed by John Wells which debuted on Showtime on January 9, 2011. It...

Reliability, availability and serviceability

Definitions

Failure types

Failure responses

Hardware features