Every IT system serves a purpose. To fulfill that purpose, it needs to be available for use. Some IT systems are critical to a business and, hence need to be highly available with no or minimal periods of time of being partially or wholly unavailable. Other systems are less vital and a certain amount of scheduled or unscheduled downtime is acceptable, for example, if users can fall back on alternative workflows or simply wait until the system becomes available again. Many systems fall somewhere in the middle between these two extremes.
High availability (HA) is a design approach that enables a system to meet a prearranged level of operational performance over a specific period of time. Highly available systems provide customers with a reliable system and environment to meet or exceed their business requirements for service delivery and perform at an expected level of quality.
High availability, while related to disaster recovery (DR), is a separate concept. Generally, HA is focused on avoiding unplanned downtime for service delivery, whereas DR is focused on retaining the data and resources needed to restore a system to a previous acceptable state after a disaster. When DR plans are implemented, it is typical for service delivery to be disrupted until the system has been restored. See Backups and disaster recovery for more information.
Another commonly-used term in this space is geographic redundancy, which generally refers to the design goal of having an application or system that can survive a full data center outage by having additional systems or a backup system available in a different geographic location. This approach can help to protect against natural disasters, power outages, or other disruptions to data center availability.
Many architects use a common set of terms to refer to and describe in detail a system or component level approach to high availability. The most common terms used in this area include:
One metric by which availability can be measured is uptime, which is generally measured as the percentage of time a system has been “available” over a certain period. The definition of available is subjective and should be set early in the system design process so that shared agreement on this target can be reached. A desired level of availability is frequently defined as a targeted uptime, often expressed in terms of nines. For example:
Availability targets can be formalized in the form of a service level agreement (SLA) between the users of a system and the organization operating that system. Often SLAs include other performance-related metrics beyond pure availability targets, such as expected response times, and can define penalties for meeting those targets if there is a vendor and customer relationship. Internal SLAs are also equally important, though generally they do not include the penalty and reporting requirements of a customer-facing SLA.
Another approach that organizations take related to availability is to establish criticality tiers for systems that they maintain, ranging from non-essential to foundational, depending on the impact that an outage may have on an organization. Considerations may include user experience, financial, reputation, and regulatory impact, and each criticality tier may have a different target SLA definition. Some organizations might refer to a certain system as “Tier 1” or “Business Critical” which other systems are “Tier 2” or less business critical and thus have fewer constraints or different configurations.
Designing and building a system to meet a pre-defined level of availability requires a holistic approach that considers many different aspects or topic areas, including:
Building a system that accommodates higher uptime demands typically requires a significant upfront and ongoing investment of time and resources when compared to a baseline that meets just a standard level of availability. However, high availability is not an all-or-nothing proposition, and it is often useful to consider if there are sub-systems for which availability targets can be relaxed without significantly impacting the business value of an IT system.
The process of designing a highly available system does not start with a blank canvas. In most cases, an organization’s existing IT infrastructure, policies, expertise, and preferences will determine the overall framework that an enterprise GIS system needs to accommodate. This includes the uptime or availability expectations of supporting systems and which IT components are available to assist with achieving a high level of availability. Consider interdependencies between decisions, where one design decision often begets another. Many of these details can be thought of as design constraints, which help to guide a design process towards a mutually agreeable destination, where the system meets overall requirements while aligning to standards already set by the organization and balancing cost, manageability, and other factors.
Often, design constraints fall into these categories:
Similarly, IT organizations may limit the range of choices for infrastructure further, for example, specific make and models for physical hardware, virtualization layers, storage systems, load balancers, reverse proxies, and so on.
Leveraging commercial cloud-based Infrastructure as a Service (IaaS), be those virtual machines or Kubernetes clusters, also constrains your options.
With respect to ArcGIS Enterprise, high availability refers to measures that increase the availability of a single deployment of ArcGIS Enterprise. Replicated deployments, normally geographically distributed in another data center or in another cloud region, provide a disaster recovery capability. Learn about High availability in ArcGIS Enterprise.
ArcGIS Enterprise provides higher levels of availability through the combination of multiple machines in different configurations. The components of ArcGIS Enterprise use different approaches to achieving high availability:
A highly available portal site consists of two servers that are joined together to create the HA site. They are each fully redundant, but the system maintains one machine as the primary node and the other machine is the standby node. If the primary machine fails, the standby machine will detect the failure and promote itself to become the primary.
At the web server level, the system is active-active, as each portal node is able to service incoming requests and the search indices are kept in sync across both systems. However, only one node handles state changes, where edits, member invites, and configurations are saved to the portal database, so the overall system is considered active-passive.
A highly available portal also requires a load balancer to distribute requests between the two nodes, usually in a round-robin fashion. The primary and standby nodes share state through inter-machine communication via ports and database synchronization, but also rely on shared file storage for the portal’s content directory, which can be an NFS file share, a UNC file share, or cloud-native object storage.
Read more about configuring a highly-available Portal for ArcGIS deployment.
A highly available GIS server site consists of two or more fully redundant machines, which are joined together into an ArcGIS Server “site” where workloads are load-balanced across all nodes, which is an active-active configuration. A highly available GIS server also requires a load balancer to route requests to the member machines, usually with a round-robin approach, though web traffic can also be routed in a primary-standby manner.
The machines in a site share state primarily through a shared storage location for the server directories and the configuration store, usually a NFS-type or UNC-based file share. For cloud systems, cloud-native options are also available for the configuration store, such as DynamoDB and S3 storage in AWS or Azure Files storage in Microsoft Azure.
It is worth noting that some specialized GIS server roles, such as GeoEvent Server, cannot be configured to run in a multi-machine site. As a result, special considerations apply to achieve higher levels of availability for those GIS server roles.
Read more about configuring a multi-machine site to achieve a high-availability deployment of ArcGIS Server. Resources for single-machine high availability deployments are also available.
The Web Adaptor can be deployed redundantly across two or more machines, with each instance being fully redundant in an active-active configuration. This configuration requires a front-end load balancer that clients send requests to, which distributes requests across both web adaptor hosts. Further resources are available in the documentation.
Relational data store and graph data store: A highly availability relational data store or graph store consists of exactly two fully redundant instances in an active-passive configuration. If the primary data store machine fails, the standby machine will detect the failure and promote itself to become the primary.
Tile cache data store: The tile cache data store supports two different high availability configurations, either a two-node configuration with a primary machine and a fully redundant standby machine (active-passive), or a cluster with at least three odd-numbered nodes (active-active). In clustered mode, each piece of data is redundantly deployed across at least two machines, albeit no single machine will have a copy of all data. Even if one machine is lost, the data it hosted is available on at least one other machine.
Spatiotemporal big data store and object store: These types of data stores also support cluster modes. Clusters should contain an odd number of machines – required for consensus finding among the members – as well as a minimum of three machines. These configurations are all active-active high availability configurations.
The documentation topic Add machines to a data store provides additional guidance, steps and recommendations.
The availability of database resources is a specialized field of architecture with many provider-specific options for each individual database offering, including both active-active and active-passive patterns. In general, ArcGIS can connect to these configurations when the data registration process by which services are accessed or published uses a DNS alias or flexible IP, which is always accessed from ArcGIS but which might point to a different backend database if there is an outage of the primary system. In this scenario, the ArcGIS components are unaware of the change in backend database and continue to work as expected assuming the same credentials, schema and rows are available.
To make educated and effective choices related to high availability, consider these design recommendations: