All enterprise systems are expected to be reliable, but the definition of this term can vary significantly depending on the use case, business criticality, and integrations that define a system. Though reliability relates to other terms such as business continuity, failure management, resilience, high availability, stability, uptime, fault tolerance or redundancy, this pillar is titled reliability to capture the wide spectrum of definitions that organizations have. The most important recommendation in this section is to consider the definition of a reliable system to be highly subjective. Each organization is different, has different workflows, data, and expectations, and the process of defining and achieving reliability will be unique to each organization.
Well-architected systems are designed to reliably provide the capabilities, functionality and workflows that business requirements demand. This does not mean that every system must be highly available and highly redundant to be well-architected, because for some systems this would indicate an improper level of complexity or cost that has a negative overall impact on the system.
Reliability is often measured through the concept of a Service-level Agreement (SLA). An SLA refers to a commitment to business users and clients of a system to maintain a level of service at all times. An SLA is usually defined through specific performance goals such as “all requests must respond in under three seconds”, or uptime and “availability” levels. An availability SLA defines:
SLAs are a common requirements for enterprise systems, but well-architected systems focus on properly defining, monitoring, and responding to SLAs rather than a simple definition of having a higher number as a goal. The definition of these details is a complex process that highlights the depth of planning and design that needs to go into discussions about reliability.
When systems are designed for a certain level of reliability, they may use a combination of several techniques to achieve it, including:
Each of these technical techniques is discussed in more detail within this section, with an overview of options, recommendations and approaches.