Backups and disaster recovery

For enterprise systems with availability expectations, requirements, or commitments, a clearly-defined, actionable, and well-tested backup and disaster recovery (DR) approach is critical. Designing a backup strategy or a disaster recovery approach requires that organizations first understand the extent and dependencies of their systems, before carefully establishing the goals for recovery, understanding the available IT resources, and considering the process of recovery, from triggering a backup or restore to the feasibility of staff support and user impacts.

The importance or role of backups primarily relates to the business criticality of a system, and what workflows it supports, and categories of data are stored there. Some systems, like a development server or one that is used by a small team, may not need a backup strategy, while other systems may use backup and DR approaches primarily to test the processes and inform the successful application of them to a production system.

This topic provides an overview of the backup and disaster recovery process, including implications for ArcGIS software components and available options to support a backup or disaster recovery approach for various ArcGIS applications.

Understand backups

The process of backing up a system, piece of data, or hardware component has always been important to IT systems. While, historically, backups were generally implemented at a storage level, the proliferation of cloud services has introduced several new options related to backups, primarily the need to back up cloud-hosted data and SaaS systems, along with new locations and providers that can be used to store backups at a data level.

Considerations for backups include a careful definition of the following:

Scope of which systems or data are backed up
Method by which the backup is conducted, created, or managed
Location where the backups are stored, and implications related to that choice.

In addition, the method used to restore a system from a backup is also important, as well as the impact it has on its users.

Historically, only the most critical business systems or data were reliably backed up, so the scope of backups was focused on the most important systems, servers, or databases. With the modern cost of storage significantly lower than in the past, the scope of backups has expanded dramatically, trending towards entire-system backups rather than selectively backing up one application component or data type.

Backup methods

The methods used to create backups can vary widely, from application-specific backup formats like a binary database dump to virtual machine images, to backups of configurations or composite backups that might combine the state of an application with data and the application code itself. The method used to create backups can have a significant impact on the time it takes to create a backup, the frequency with which it can be run, and how it is restored. Backups also may differ in how they are created, and how well that captures the state of a running system. There are three main types of backups to consider in ArcGIS systems:

VM or system-level backups: a backup that is created through the hypervisor hosting a virtual machine, or through a cloud console as a VM snapshot, backs up all content on the system so that a new system can be created, or a current system can be “rolled back” to the backup at some point when necessary.
Application-level backups refer to backups that are created from an application itself, such as an ArcGIS Server or Portal for ArcGIS backup, or the WebGIS DR tool backup process. This could also refer to an export of a GeoEvent Server configuration, or application-level functionality that supports backups like the use of versioning and backups for documents in Microsoft OneDrive.
Data-level backups are used at the disk, file system, or database level to backup up that component of a system. While key configurations might exist in an application, backing up the data is a common approach to ensuring that the valuable content (if stored in a database) is backed up and could be restored in the future.

The frequency with which a system is backed up and the initiation time at which those backups occur are also both important considerations. Backing up too frequently can lead to excess cost or increased system interruptions, while backing up too infrequently might mean unacceptable data loss between an outage event and the most recent backup. Most backups are designed to avoid interrupting users at some level, but the timing of a backup can impact what data and workflows are captured in the backup. Usually backups are taken during the “off hours” of a system (if those exist) so as to capture most of the previous day’s data and information, rather than in the middle of a working day where the backup might only capture part of a longer process or larger dataset that is actively being worked on.

The reason why a backup process exists may seem self-evident, for example, to restore the state of a system, but backups are additionally used for a variety of nuanced use cases. Some backups are tied to system upgrades, so that if an issue with the upgrade is discovered, the system can be brought back to a known state. The same consideration drives backup processes tied to a disruptive system change like the removal of a server or the addition of a new component. Some backups are created only for the worst-case scenario, so that when that disaster occurs, they can be used to recover from it. Others are used to snapshot the state of an environment and create a new, lower environment as a copy, or to promote content in the other direction from a lower environment towards a higher, production-level system.

Understand disaster recovery

Disaster recovery or DR is a topic that is often closely tied to a backup and restore process, but they are not synonymous. DR specifically refers to “what do we do to recover after a disaster,” the process that recovery would require, and the definition of what both “disaster” and “recover” mean in an organization-specific context. Disasters usually refer to significant IT system outages, whether caused by hardware, environmental or user configuration issues, which lead to a significant interruption in a system’s uptime, availability, or functionality.

One of the first steps in establishing a DR process is what constitutes a “disaster” that necessitates a disaster recovery process. Defining this carefully is important, as too sensitive of a definition (such as one failed request to a primary system) can lead to frequent DR failover actions, but waiting until an outage has exceeded four hours in duration might be too permissive and result in extensive user interruption. In the end, defining “what constitutes a disaster” so that a system can take action to either initiate a recovery or a failover, involves both automated and human-managed decision making, and will rely on organizational definitions of concepts like availability, criticality and uptime. Common examples of a DR “initiator” would be loss of access to a data center, storage failure, VM hypervisor failures, a DNS outage or network connectivity problem, and even data corruption or a ransomware attack on a system.

Another key definition in the DR process is the organization’s Recovery Point Objective (RPO) and Recovery Time Objective (RTO). RPO equates to how much data do we lose in a disaster scenario and refers to the type of backups that are created and the frequency with which data or configurations are backed up. A RPO of four hours would mean that at worst, four hours of data or work is lost when a DR-level outage occurs, and the organization is comfortable with potentially losing those 4 hours of data or inputs in a DR process. While a low RPO seems like an obvious approach, in most systems the complexity of backups and potential to interrupt users means that a low RPO has too high of a cost in resources, storage or user impact.

RTO means how long until we are back up again and refers to the time taken to restore backups to a DR system, create a new system from scratch, or automatically deploy resources to recover from a disaster. RTO is often lower than RPO, as a failover system, or components of one, might already exist and the ability to switch to that system only requires a DNS or networking change. Defining and achieving a realistic RPO and RTO are both important parts of building a DR strategy, and of getting agreement from stakeholders as to the cost and workflow implications of implementing this strategy.

DR approaches

One common approach to responding to a disaster scenario is to switch traffic to a failover system, which exists to take on traffic when a primary system identifies or experiences issues. This approach has several important requirements that have cost and workflow implications:

The target system must be always-on, or quickly ready for requests. If a low RTO is desired, a system may be kept running at full capacity at all times (sometimes called a hot standby) so that traffic can be directed to it at any time.
The system must be kept up to date. As the live system changes and new data are created or submitted, backups must be taken and regularly applied to the failover system, or automatic replication must occur, so that the RPO meets desired constraints when traffic is switched to the system.
The environment must have the proper IT functionality and controls to first detect the outage, then to make the switch to the other system, though a load balancer, global DNS resolution change or other process. If the users of a system access it through a load balancer, and the load balancer experiences a disastrous outage, the existence of the failover system is meaningless as it cannot be functionally accessed or used to restore user access.

Another DR approach is to use backups to rebuild a system, which might result in a longer RTO, but reduces ongoing cost as a failover system is not maintained at all times. This may involve taking VM backups or an application-level backup, and then rebuilding the system, either automatically through infrastructure-as-code and software deployment automation, or manually with an experienced team, before the backup is restored, traffic is switched back to the system and users can resume workloads in the system.

DR processes can also have broader implications or approaches that are not unique to the enterprise system built with ArcGIS. For example, an entire data center may have a DR strategy that uses VM replication to maintain a set of VMs in a separate data center, and when an outage with a storage or hypervisor component is identified, all users might switch to access the secondary data center.

DR processes also need to consider what happens after a recovery. If there is a primary system or data center, and a DR event leads to a failover to a secondary system, is the standard approach to switch traffic back to the primary systems after the outage is resolved? Or does the standby system now become the primary, and the original system is set up as a standby to support the next DR situation? Both scenarios have value, and defining what happens after the incident is resolves is an important part of completing the DR strategy and approach.

Backups with ArcGIS Enterprise

In ArcGIS systems, there are several approaches to initiating a backup process that are built into the ArcGIS software. Single-component backups, which only back up the contents and state of a single application, are available for ArcGIS Enterprise components including Portal for ArcGIS, ArcGIS Server, and the ArcGIS Data Store. These component-specific backups are not primarily intended for DR purposes, but are intended to backup and restore in place – for DR purposes the WebGISDR tool is recommended. Each component has a different backup approach, including:

A portal “site export” backs up the state of the portal application, along with content, users, and groups, sharing settings and relevant system information. This backup is created as a .portalsite backup file and run from the Portal Administrator API. This backup is an all-in-one process where each backup contains the full contents of the system and can take some time to create when there are many users or a large amount of data in a system. The portal backup also includes all configurations made to applications hosted in the portal, such as Experience Builder apps or Dashboards. For more information see Export Site.
ArcGIS Server site exports are created from the Server Administrator API and create an export file that contains all relevant service configurations and any data that was copied to the server during publishing. This backup, as described here in Export Site, does not include tilecache or other datasets that are accessed by services published in a “by reference” publishing configuration.
The ArcGIS Data Store has several built-in backup methods that are used to back up the configuration, including connectivity to the ArcGIS Server site, and the content, including rows or content in the relational, tilecache, graph or spatiotemporal big data store. The object store has no backup capability. These backups are created by running a backup utility in the ArcGIS Data Store install directory.

Several ArcGIS Server roles are generally stateless, where all relevant configurations are stored as portal items and generally do not need to be backed up. This includes Notebook Server, Business Analyst Enterprise Server (GeoEnrichment Server), and Raster Analytics Server. Each of these components can generally be re-created from a blank machine or installed image instead of restoring the specific deployment, as the content is stored in ArcGIS Enterprise. The only consideration beyond this would be scenarios where custom code or third party libraries have been deployed to the system to enable a Python-based workflow or other process.

Beyond component-specific backup functionality, ArcGIS Enterprise also has a multi-component backup tool called the Web GIS Disaster Recovery tool (WebGISDR for short) which can back up the state of several different ArcGIS Enterprise components at the same time.

The WebGISDR tool is described fully in the ArcGIS Enterprise documentation but it is important to note that this tool prioritizes collective application consistency and state over speed of backup or flexibility of the process. This means in practice that the tool attempts to capture the state of the system exactly at the time when the request was run. If further content or data is published while the backup is occurring, it will not appear in any of the backup files and will not be restorable to a failover system. This focus is synonymous with a focus on RPO instead of RTO, prioritizing the functional completeness of a system over how quickly the system can be restored (potentially leading to misconfigurations or bad data).

Using external tools and methods

Beyond ArcGIS application-specific backups, there are a variety of other methods that may be suggested or preferred by an organization, which may be native to a specific component or an infrastructure provider. These methods may be suitable for an ArcGIS backup or disaster recovery process but should be carefully assessed and considered as their use can also be disruptive to a system.

For example, VM snapshots are a common approach to backing up a system but can introduce complicated challenges if the method for the snapshot includes a sudden machine restart, or capture data state only, as in-process operations or configurations might not be capture or partially captured, which could lead to an unexpected and corrupted state when restore or recovery occurs.

Note:

VM-based backup strategies sometime move VM resources between two data sources to prevent an outage. In these scenarios, ensure that the ArcGIS Server and ArcGIS Pro hosts are accessing a database in their own data center, not making requests to the original data center, as this will introduce latency that will negatively affect user experience.

Cloud-based backup and DR tools, such as Microsoft Azure Site Recovery, can be compatible with ArcGIS Enterprise systems when they are carefully planned so that DNS resolution, database connectivity, and client connectivity to the system are all maintained in the case of a site recovery operation. These backup approaches operate at a relatively low virtual machine level of access, so they do not provide guarantees of application consistency. This means that while the recovery system will often be successfully restored and operated, in some cases application level inconsistencies could occur, i.e. a publishing process that is underway or a edit to a feature service that is made during the backup process. There are ways to plan for this, such as taking VM snapshots during off-peak periods, but in general the guidance to “plan, test and carefully consider implications” applies to these external tool approaches.

Database providers that build the relational databases that ArcGIS works with (either as an enterprise geodatabase or the source of a query layer) provide their own database-specific backup options, which usually can create a file-based backup of the database contents and configurations for restoration to a new system or deployment when needed.

ArcGIS Online backup considerations

ArcGIS Online, as a SaaS offering and system, needs to be approached differently from a backup and recovery perspective. As part of the system stability, Esri handles backup and recovery requirements for hardware and system-level outages, without user input, and the Service Level Agreement for ArcGIS Online reflects this commitment. ArcGIS Online does not currently provide a method for users to create organization-wide backups or content backups, and organizations will need to define a strategy for their own, additional backups of content or configurations in ArcGIS Online through a variety of patterns, including:

Creating file-based offline backups of data by exporting ArcGIS Online services to various formats
Using distributed collaboration to copy content to an ArcGIS Enterprise deployment or another ArcGIS Online subscription
Using the ArcGIS API for Python or other tools to extract and back up application configurations, files, options or other content to a file system or other backup system like a Git repository
Both Experience Builder and Sites with ArcGIS Hub or Enterprise Sites offer a “draft” mode which allows you to preview the results of a configuration, even review that with testers, before publishing the draft to the final version.

Note that Esri Partners have also created backup solutions which can be reviewed or purchased through the ArcGIS Marketplace.

ArcGIS Online recently introduced the Recycle Bin capability, which will store deleted content for a period of 14 days (by default) before permanent deletion. Content in the Recycle bin can be restored to its previous status and location with a simple workflow. This will aide with preventing the deletion of content that is interlinked with other content, but is not clearly identified as being reliant on, or relying on that other content.

For foundational hosted feature service data, using the Change Tracking capability to store previous row contents as edits are made. These previous versions can be accessed through the Extract Changes operation.