Best practices for reliability

Identifier Best practice
R.1 Target a reasonable reliability approach - more is not always better. Configuring an internal-facing system for high availability can sound like a good idea, but if the operational burden of maintaining a larger infrastructure distracts from the organization’s mission, then the increased reliability comes at too high a cost. Carefully establish expectations of recovery time with stakeholders.
R.2 Take regular system backups using supported tools, and create a plan for testing the restoration of backups on a regular basis. Backup workflows without testing create risk of a failed system restore.
R.3 Understand the weakest links in your system reliability strategy, which could be technical, personnel or process gaps. System uptime and SLA guarantees are limited by their weakest support system or component.
R.4 Use lower environments to mirror configurations and and test reliability approaches like high availability and backup processes.
R.5 Define escalation paths to ensure that issues reach the right staff quickly, and action can be taken to resolve any problems.
R.6 Understand user workflows - while a service may report that it is working based on a simple health check, if a user’s workflow is not successful, they will often see it as a system outage. Understanding real user workflows can help to quickly narrow down on the problematic service or component and fix the problem they are seeing.
In this topic
Top