High availability

Every IT system serves a purpose. To fulfill that purpose, it needs to be available for use. Some IT systems are critical to a business and, hence need to be highly available with no or minimal periods of time of being partially or wholly unavailable. Other systems are less vital and a certain amount of scheduled or unscheduled downtime is acceptable, for example, if users can fall back on alternative workflows or simply wait until the system becomes available again. Many systems fall somewhere in the middle between these two extremes.

Define and understand availability

High availability (HA) is a design approach that enables a system to meet a prearranged level of operational performance over a specific period of time. Highly available systems provide customers with a reliable system and environment to meet or exceed their business requirements for service delivery and perform at an expected level of quality.

Note:

High availability, while related to disaster recovery (DR), is a separate concept. Generally, HA is focused on avoiding unplanned downtime for service delivery, whereas DR is focused on retaining the data and resources needed to restore a system to a previous acceptable state after a disaster. When DR plans are implemented, it is typical for service delivery to be disrupted until the system has been restored. See Backups and disaster recovery for more information.

Another commonly-used term in this space is geographic redundancy, which generally refers to the design goal of having an application or system that can survive a full data center outage by having additional systems or a backup system available in a different geographic location. This approach can help to protect against natural disasters, power outages, or other disruptions to data center availability.

Many architects use a common set of terms to refer to and describe in detail a system or component level approach to high availability. The most common terms used in this area include:

  • Active-active, which is a fault tolerance pattern that relies on two or more systems actively receiving and processing requests. In an active-active system, each node that receives request is equal to the others, and requests are usually load-balanced so that they are sent in roughly equal proportion to each backend node.
  • Active-passive, also referred to as primary/standby, is a fault tolerance pattern where user requests flow entirely to one system, with a second system awaiting a scenario where it is required, either through a primary system failure (where the active system stops processing requests) or through a scheduled or planned switch, either of which results in user traffic being sent to the passive system, which is now considered the active system.
  • Failover systems are designed to be kept fully in sync with an active system, but be ready to receive traffic, only receiving that traffic if the primary system has failed for some reason. A failover system is similar to an active-passive configuration, but may have different workflows associated with “failing over” to that system.

Availability targets

One metric by which availability can be measured is uptime, which is generally measured as the percentage of time a system has been “available” over a certain period. The definition of available is subjective and should be set early in the system design process so that shared agreement on this target can be reached. A desired level of availability is frequently defined as a targeted uptime, often expressed in terms of nines. For example:

  • 99% (two nines) - is equivalent to 3.65 days of allowed downtime per year
  • 99.9% (three nines) - is equivalent to 8.77 hours of allowed downtime per year

Availability targets can be formalized in the form of a service level agreement (SLA) between the users of a system and the organization operating that system. Often SLAs include other performance-related metrics beyond pure availability targets, such as expected response times, and can define penalties for meeting those targets if there is a vendor and customer relationship. Internal SLAs are also equally important, though generally they do not include the penalty and reporting requirements of a customer-facing SLA.

Criticality tiers

Another approach that organizations take related to availability is to establish criticality tiers for systems that they maintain, ranging from non-essential to foundational, depending on the impact that an outage may have on an organization. Considerations may include user experience, financial, reputation, and regulatory impact, and each criticality tier may have a different target SLA definition. Some organizations might refer to a certain system as “Tier 1” or “Business Critical” which other systems are “Tier 2” or less business critical and thus have fewer constraints or different configurations.

Designing and building a system to meet a pre-defined level of availability requires a holistic approach that considers many different aspects or topic areas, including:

  • Carefully selecting higher-grade software and hardware components that have been specifically designed to reduce the mean time between failures (MTBF), as opposed to using commodity components.
  • Eliminating any single point(s) of failure in the system by providing redundancy of all components. A single point of failure is a piece of software or hardware that when it fails, causes the entire system to become unavailable. With a single point of failure, a system can tolerate the failure of any one component without impacting availability noticeably.
  • Establishing plans for bringing back a system should it indeed become unavailable and testing those plans. This might include defining targets for the acceptable amount of time it takes to make the system available again and how much data loss is allowable.
  • Enforcing policies and procedures with a focus on change management to minimize the chances of accidental or unintended interruptions, for example, due to human error.

Building a system that accommodates higher uptime demands typically requires a significant upfront and ongoing investment of time and resources when compared to a baseline that meets just a standard level of availability. However, high availability is not an all-or-nothing proposition, and it is often useful to consider if there are sub-systems for which availability targets can be relaxed without significantly impacting the business value of an IT system. 

Design considerations

The process of designing a highly available system does not start with a blank canvas. In most cases, an organization’s existing IT infrastructure, policies, expertise, and preferences will determine the overall framework that an enterprise GIS system needs to accommodate. This includes the uptime or availability expectations of supporting systems and which IT components are available to assist with achieving a high level of availability. Consider interdependencies between decisions, where one design decision often begets another. Many of these details can be thought of as design constraints, which help to guide a design process towards a mutually agreeable destination, where the system meets overall requirements while aligning to standards already set by the organization and balancing cost, manageability, and other factors.

Often, design constraints fall into these categories:

  • Business needs: An organization’s business requirements determine what amount of downtime is acceptable, ranging from zero downtime to hours or days of downtime before the system is restored. This is called the Recovery Time Objective (RTO). Business requirements are also indicative of how much data loss, expressed in time, can be tolerated in case of a failure. This is called the Recover Point Objective (RPO) and typically ranges between zero seconds to a week’s worth of data loss.
  • Deployment pattern: Choosing a given deployment pattern will often pre-determine the degree to which availability considerations must be factored into detailed design decisions. In other words, when building a system around SaaS or PaaS offerings, many of those decisions have already been made by the organization that is hosting the offering. On the other hand, deploying GIS server software to a data center that your organization owns and manages provides the highest level of flexibility with respect to meeting your exact availability requirements but also comes with the most responsibilities.
  • Infrastructure: In most cases, IT professionals who design and build GIS systems to deploy into data centers that your organization operates do not need to concern themselves with basic physical infrastructure, such as hosting facilities, power, cooling, and network because they are already established and usually provide high levels of availability of those basic commodities.
  • Maintenance: For some systems, the ability to patch or update systems without downtime is critical, to ensure that users are not impacted or other systems are able to function continuously. In these cases, patching on a rolling basis or using a blue-green or primary/standby environment may be consistent with those goals, but the potential impact of desired maintenance actions and the maintenance cadence is important to consider in any system design.

Similarly, IT organizations may limit the range of choices for infrastructure further, for example, specific make and models for physical hardware, virtualization layers, storage systems, load balancers, reverse proxies, and so on.

Leveraging commercial cloud-based Infrastructure as a Service (IaaS), be those virtual machines or Kubernetes clusters, also constrains your options.

  • Software: The levels to which software components that make up a system support higher degrees of availability vary, ranging from no designated support to fully supported and documented high availability configurations. They also differ in degree: Not all levels of availability can be achieved with a given piece of software, which then limits the SLA, RPO, or RTO that can be achieved.
  • People and Processes: It is often preferable to leverage established processes and procedures for building and managing highly available systems your organization has already established, to benefit from existing expertise.

High availability patterns

With respect to ArcGIS Enterprise, high availability refers to measures that increase the availability of a single deployment of ArcGIS Enterprise. Replicated deployments, normally geographically distributed in another data center or in another cloud region, provide a disaster recovery capability.

ArcGIS Enterprise provides higher levels of availability through the combination of multiple machines in different configurations. The components of ArcGIS Enterprise use different approaches to achieving high availability:

  • Portal for ArcGIS: A highly available portal site consists of two servers that are joined together to create the HA site. They are each fully redundant, but the system maintains one machine as the primary node and the other machine is the standby node. If the primary machine fails, the standby machine will detect the failure and promote itself to become the primary. At the web server level, the system is active-active, as each portal node is able to service incoming requests and the search indices are kept in sync across both systems. However, only one node handles state changes, where edits, member invites, and configurations are saved to the portal database, so the overall system is considered active-passive. A highly available portal also requires a load balancer to distribute requests between the two nodes, usually in a round-robin fashion. The primary and standby nodes share state through inter-machine communication via ports and database synchronization, but also rely on shared file storage for the portal’s content directory, which can be an NFS file share, a UNC file share, or cloud-native object storage.
  • GIS Server: A highly available GIS server site consists of two or more fully redundant machines, which are joined together into an ArcGIS Server “site” where workloads are load-balanced across all nodes, which is an active-active configuration. A highly available GIS server also requires a load balancer to route requests to the member machines, usually in a round-robin manner though web traffic can also be routed in a primary/standby manner. The machines in a site share state primarily through a shared storage location for the server directories and the configuration store, usually a NFS-type or UNC-based file share. For cloud systems, cloud-native options are also available for the configuration store, such as DynamoDB and S3 storage in AWS or Azure Files storage in Microsoft Azure.
Note:

It is worth noting that some specialized GIS server roles, such as GeoEvent Server, cannot be configured to run in a multi-machine site. As a result, special considerations apply to achieve higher levels of availability for those GIS server roles.

  • Web Adaptor: The Web Adaptor can be deployed redundantly across two or more machines, with each instance being fully redundant in an active-active configuration. This configuration requires a front-end load balancer that clients send requests to, which distributes requests across both web adaptor hosts.
  • Relational data store and graph data store: A highly availability relational data store or graph store consists of exactly two fully redundant instances in an active-passive configuration. If the primary data store machine fails, the standby machine will detect the failure and promote itself to become the primary.
  • Tile cache data store: The tile cache data store supports two different high availability configurations, either a two-node configuration with a primary machine and a fully redundant standby machine (active-passive), or a cluster with at least three odd-numbered nodes (active-active). In clustered mode, each piece of data is redundantly deployed across at least two machines, albeit no single machine will have a copy of all data. Even if one machine is lost, the data it hosted is available on at least one other machine.
  • Spatiotemporal big data store and object store: These types of data stores also support cluster modes. Clusters should contain an odd number of machines – required for consensus finding among the members – as well as a minimum of three machines. These configurations are all active-active high availability configurations.
  • Database and enterprise geodatabase: The availability of database resources is an entirely specialized field of architecture with many provider-specific options for each individual database offering, including both active-active and active-passive patterns. In general, ArcGIS can work with these configurations when the data registration process by which services are accessed or published uses a DNS alias or flexible IP, which is always accessed from ArcGIS but which might point to a different backend database if there is an outage of the primary system. In this scenario, the ArcGIS components are unaware of the change in backend database and continue to work as expected assuming the same credentials, schema and rows are available.

Design recommendations

To make educated and effective choices related to high availability, consider these design recommendations:

  • Backups: Creating backups of your ArcGIS deployment is the easiest way to avoid data loss and reduce downtime. This allows you to recover items that existed at the time the backup was created, and significantly reduces the time to operation.
  • Managing read-only and read/write systems: If your application or system is a read-only system, where users primarily view data that is managed elsewhere, achieving a high availability configuration is simpler, as an active-active configuration can be utilized. If edits are created by users, in a read/write system, usually there are a mix of active-active and active-passive tiers of the architecture, which allow the maintenance of a primary, active database of record, which also maintaining the passive, standby or failover system to be ready to receive traffic in case of an outage.
  • Duplication and redundancy: Implement multiple instances of specific system components, potentially including geographic redundancy, to reduce single points of failure. Consider the skills required to maintain the system. Consider that a person can just as easily be a single point of failure.
  • Test plans and system monitoring: Evaluate the system’s ability to meet the required service level by testing stress, performance, and failover functions. All testing plans and associated activities should be part of your overall system governance. Continually monitor the system to correct problems before they cause a widespread or unrecoverable outage.
  • Other high availability best practices include automation, collaboration, load balancing, publication strategies, and workload separation.
Top