Reliability pillar

All enterprise systems are expected to be reliable, but the definition of this term can vary significantly depending on the use case, business criticality, and integrations that define a system. Though reliability relates to other terms such as business continuity, failure management, resilience, high availability, stability, uptime, fault tolerance or redundancy, this pillar is titled reliability to capture the wide spectrum of definitions that organizations have. The most important recommendation in this section is to consider the definition of a reliable system to be highly subjective. Each organization is different, has different workflows, data, and expectations, and the process of defining and achieving reliability will be unique to each organization.

Well-architected systems are designed to reliably provide the capabilities, functionality and workflows that business requirements demand. This does not mean that every system must be highly available and highly redundant to be well-architected, because for some systems this would indicate an improper level of complexity or cost that has a negative overall impact on the system.

Reliability is often measured through the concept of a Service-level Agreement (SLA). An SLA refers to a commitment to business users and clients of a system to maintain a level of service at all times. An SLA is usually defined through specific performance goals such as “all requests must respond in under three seconds”, or uptime and “availability” levels. An availability SLA defines:

  • A set of services, applications, or workflows
  • A time period during which they are expected to be available, usually 24 hours a day and 7 days a week
  • A definition of what conditions define a breach of the SLA, such as repeated failures, slow requests, errors or degraded user experiences

SLAs are a common requirements for enterprise systems, but well-architected systems focus on properly defining, monitoring, and responding to SLAs rather than a simple definition of having a higher number as a goal. The definition of these details is a complex process that highlights the depth of planning and design that needs to go into discussions about reliability.

When systems are designed for a certain level of reliability, they may use a combination of several techniques to achieve it, including:

  1. High availability – The practice of having redundant software components, often geographically redundant, which provide the same services in case of an outage
  2. Backups – The practice of regularly backing up the state of a system, the important details, data, or services, for recovery in the future
  3. Disaster Recovery – A process of planning for uncommon but disruptive scenarios, and identifying methods to quickly rebuild a site or system if a scenario occurs
  4. Monitoring – Using effective monitoring to identify potential issues early, measure the amount of downtime a system experiences, and report those results for post-event summaries. This relates closely to the observability pillar

Each of these technical techniques is discussed in more detail within this section, with an overview of options, recommendations and approaches.

In this topic