For enterprise systems with availability expectations, requirements, or commitments, a clearly-defined, actionable, and well-tested backup and disaster recovery (DR) approach is critical. Designing a backup strategy or a disaster recovery approach requires that organizations first understand the extent and dependencies of their systems, before carefully establishing the goals for recovery, understanding the available IT resources, and considering the process of recovery, from triggering a backup or restore to the feasibility of staff support and user impacts.
The importance or role of backups primarily relates to the business criticality of a system, and what workflows it supports, and categories of data are stored there. Some systems, like a development server or one that is used by a small team, may not need a backup strategy, while other systems may use backup and DR approaches primarily to test the processes and inform the successful application of them to a production system.
This topic provides an overview of the backup and disaster recovery process, including implications for ArcGIS software components and available options to support a backup or disaster recovery approach for various ArcGIS applications.
The process of backing up a system, piece of data, or hardware component has always been important to IT systems. While, historically, backups were generally implemented at a storage level, the proliferation of cloud services has introduced several new options related to backups, primarily the need to back up cloud-hosted data and SaaS systems, along with new locations and providers that can be used to store backups at a data level.
Considerations for backups include a careful definition of the following:
In addition, the method used to restore a system from a backup is also important, as well as the impact it has on its users.
Historically, only the most critical business systems or data were reliably backed up, so the scope of backups was focused on the most important systems, servers, or databases. With the modern cost of storage significantly lower than in the past, the scope of backups has expanded dramatically, trending towards entire-system backups rather than selectively backing up one application component or data type.
The methods used to create backups can vary widely, from application-specific backup formats like a binary database dump to virtual machine images, to backups of configurations or composite backups that might combine the state of an application with data and the application code itself. The method used to create backups can have a significant impact on the time it takes to create a backup, the frequency with which it can be run, and how it is restored. Backups also may differ in how they are created, and how well that captures the state of a running system. There are three main types of backups to consider in ArcGIS systems:
The frequency with which a system is backed up and the initiation time at which those backups occur are also both important considerations. Backing up too frequently can lead to excess cost or increased system interruptions, while backing up too infrequently might mean unacceptable data loss between an outage event and the most recent backup. Most backups are designed to avoid interrupting users at some level, but the timing of a backup can impact what data and workflows are captured in the backup. Usually backups are taken during the “off hours” of a system (if those exist) so as to capture most of the previous day’s data and information, rather than in the middle of a working day where the backup might only capture part of a longer process or larger dataset that is actively being worked on.
The reason why a backup process exists may seem self-evident, for example, to restore the state of a system, but backups are additionally used for a variety of nuanced use cases. Some backups are tied to system upgrades, so that if an issue with the upgrade is discovered, the system can be brought back to a known state. The same consideration drives backup processes tied to a disruptive system change like the removal of a server or the addition of a new component. Some backups are created only for the worst-case scenario, so that when that disaster occurs, they can be used to recover from it. Others are used to snapshot the state of an environment and create a new, lower environment as a copy, or to promote content in the other direction from a lower environment towards a higher, production-level system.
Disaster recovery or DR is a topic that is often closely tied to a backup and restore process, but they are not synonymous. DR specifically refers to “what do we do to recover after a disaster,” the process that recovery would require, and the definition of what both “disaster” and “recover” mean in an organization-specific context. Disasters usually refer to significant IT system outages, whether caused by hardware, environmental or user configuration issues, which lead to a significant interruption in a system’s uptime, availability, or functionality.
One of the first steps in establishing a DR process is what constitutes a “disaster” that necessitates a disaster recovery process. Defining this carefully is important, as too sensitive of a definition (such as one failed request to a primary system) can lead to frequent DR failover actions, but waiting until an outage has exceeded four hours in duration might be too permissive and result in extensive user interruption. In the end, defining “what constitutes a disaster” so that a system can take action to either initiate a recovery or a failover, involves both automated and human-managed decision making, and will rely on organizational definitions of concepts like availability, criticality and uptime. Common examples of a DR “initiator” would be loss of access to a data center, storage failure, VM hypervisor failures, a DNS outage or network connectivity problem, and even data corruption or a ransomware attack on a system.
Another key definition in the DR process is the organization’s Recovery Point Objective (RPO) and Recovery Time Objective (RTO). RPO equates to how much data do we lose in a disaster scenario and refers to the type of backups that are created and the frequency with which data or configurations are backed up. A RPO of four hours would mean that at worst, four hours of data or work is lost when a DR-level outage occurs, and the organization is comfortable with potentially losing those 4 hours of data or inputs in a DR process. While a low RPO seems like an obvious approach, in most systems the complexity of backups and potential to interrupt users means that a low RPO has too high of a cost in resources, storage or user impact.
RTO means how long until we are back up again and refers to the time taken to restore backups to a DR system, create a new system from scratch, or automatically deploy resources to recover from a disaster. RTO is often lower than RPO, as a failover system, or components of one, might already exist and the ability to switch to that system only requires a DNS or networking change. Defining and achieving a realistic RPO and RTO are both important parts of building a DR strategy, and of getting agreement from stakeholders as to the cost and workflow implications of implementing this strategy.
One common approach to responding to a disaster scenario is to switch traffic to a failover system, which exists to take on traffic when a primary system identifies or experiences issues. This approach has several important requirements that have cost and workflow implications:
Another DR approach is to use backups to rebuild a system, which might result in a longer RTO, but reduces ongoing cost as a failover system is not maintained at all times. This may involve taking VM backups or an application-level backup, and then rebuilding the system, either automatically through infrastructure-as-code and software deployment automation, or manually with an experienced team, before the backup is restored, traffic is switched back to the system and users can resume workloads in the system.
DR processes can also have broader implications or approaches that are not unique to the enterprise system built with ArcGIS. For example, an entire data center may have a DR strategy that uses VM replication to maintain a set of VMs in a separate data center, and when an outage with a storage or hypervisor component is identified, all users might switch to access the secondary data center.
DR processes also need to consider what happens after a recovery. If there is a primary system or data center, and a DR event leads to a failover to a secondary system, is the standard approach to switch traffic back to the primary systems after the outage is resolved? Or does the standby system now become the primary, and the original system is set up as a standby to support the next DR situation? Both scenarios have value, and defining what happens after the incident is resolves is an important part of completing the DR strategy and approach.
In ArcGIS systems, there are several approaches to initiating a backup process that are built into the ArcGIS software. Single-component backups, which only back up the contents and state of a single application, are available for ArcGIS Enterprise components including Portal for ArcGIS, ArcGIS Server, and the ArcGIS Data Store. These component-specific backups are not primarily intended for DR purposes, but are intended to backup and restore in place – for DR purposes the WebGISDR tool is recommended. Each component has a different backup approach, including:
Several ArcGIS Server roles are generally stateless, where all relevant configurations are stored as portal items and generally do not need to be backed up. This includes Notebook Server, GeoAnalytics Server, Business Analyst Enterprise Server (GeoEnrichment Server), and Raster Analytics Server. Each of these components can generally be re-created from a blank machine or installed image instead of restoring the specific deployment, as the content is stored in ArcGIS Enterprise. The only consideration beyond this would be scenarios where custom code or third party libraries have been deployed to the system to enable a Python-based workflow or other process.
Beyond component-specific backup functionality, ArcGIS Enterprise also has a multi-component backup tool called the Web GIS Disaster Recovery tool (WebGISDR for short) which can back up the state of several different ArcGIS Enterprise components at the same time.
The WebGISDR tool is described fully in the ArcGIS Enterprise documentation but it is important to note that this tool prioritizes collective application consistency and state over speed of backup or flexibility of the process. This means in practice that the tool attempts to capture the state of the system exactly at the time when the request was run. If further content or data is published while the backup is occurring, it will not appear in any of the backup files and will not be restorable to a failover system. This focus is synonymous with a focus on RPO instead of RTO, prioritizing the functional completeness of a system over how quickly the system can be restored (potentially leading to misconfigurations or bad data).
Beyond ArcGIS application-specific backups, there are a variety of other methods that may be suggested or preferred by an organization, which may be native to a specific component or an infrastructure provider. These methods may be suitable for an ArcGIS backup or disaster recovery process but should be carefully assessed and considered as their use can also be disruptive to a system.
For example, VM snapshots are a common approach to backing up a system but can introduce complicated challenges if the method for the snapshot includes a sudden machine restart, or capture data state only, as in-process operations or configurations might not be capture or partially captured, which could lead to an unexpected and corrupted state when restore or recovery occurs.
VM-based backup strategies sometime move VM resources between two data sources to prevent an outage. In these scenarios, ensure that the ArcGIS Server and ArcGIS Pro hosts are accessing a database in their own data center, not making requests to the original data center, as this will introduce latency that will negatively affect user experience.
Cloud-based backup and DR tools, such as Microsoft Azure Site Recovery, can be compatible with ArcGIS Enterprise systems when they are carefully planned so that DNS resolution, database connectivity, and client connectivity to the system are all maintained in the case of a site recovery operation. These backup approaches operate at a relatively low virtual machine level of access, so they do not provide guarantees of application consistency. This means that while the recovery system will often be successfully restored and operated, in some cases application level inconsistencies could occur, i.e. a publishing process that is underway or a edit to a feature service that is made during the backup process. There are ways to plan for this, such as taking VM snapshots during off-peak periods, but in general the guidance to “plan, test and carefully consider implications” applies to these external tool approaches.
Database providers that build the relational databases that ArcGIS works with (either as an enterprise geodatabase or the source of a query layer) provide their own database-specific backup options, which usually can create a file-based backup of the database contents and configurations for restoration to a new system or deployment when needed.
ArcGIS Online, as a SaaS offering and system, handles backup and recovery for disaster scenarios entirely without user input, and the Service Level Agreement for ArcGIS Online reflects this commitment. Users may choose to make their own, additional backups of content or configurations in ArcGIS Online through a variety of patterns, including: