Big data analytics system

A big data analytics system is used for analyzing large volumes of geographic and tabular data. Analytic capabilities primarily focus on vector data, but some capabilities do exist for imagery and raster data types. This system pattern leverages Apache Spark as the engine for performing large-scale data analytics in batch on distributed compute infrastructure. Spatial and temporal big data analytic results are typically written back to data stores for further downstream analysis, or to other ArcGIS systems for visualization and further geographic analysis. Functional capabilities depend heavily on the deployment pattern selected.

A big data analytics system pattern delivers value to an organization through various characteristics, such as:

  • Providing an innovative dimension to big data analysis by incorporating geographic science, improving decision-making.
  • Adding geography-based analytics to existing Apache Spark-based big data analysis workflows.
  • Exposing spatial operations to data scientists through familiar tools and experiences.
  • Quickly extracting geographic insights from location attributed (latitude and longitude) big data like GPS, AIS, human-movement, or other moving sensor datasets.
  • Storing and indexing analytic results in systems such as object storage, relational databases, and data warehouses, from which they can be shared and consumed in more intuitive applications such as web maps, story maps, and custom applications.

If you’re new to ArcGIS system patterns, review the introduction first.

User personas and workflows

The user personas who most commonly interact with big data analytics systems, along with the types of workflows and tasks they typically perform using this system, include:

  • Data analyst, scientist, and engineer. Data analysts, scientists, and engineers are the main user personas interacting with a big data analytics system. These user personas are typically familiar with Apache Spark, Python, and working with big data, and these specialized skills are required to maximize value from the spatially enabled big data analytics systems presented here. Data analysts, scientists, and engineers work with and prepare big data, design, develop, and conduct analysis routines, as well as visualize and study analysis results. The work of this user persona is typically iterative, and often also involves describing and sharing analysis results with other stakeholders.
  • GIS analyst. GIS analysts are not typically the primary users of big data analytics systems, as typically the skills required to do so are outside the scope of the GIS analyst role. However, GIS analysts commonly work alongside data analysts, scientists, and engineers to ensure that important spatial concepts are understood and that best practices for working with geospatial data and analysis methods and tools are applied.

To get the most value out of a big data analytics system, consider involving both personas above or individuals possessing skills from both personas.

Applications

While there are many applications and experiences provided by ArcGIS, typically big data analytics systems expose only lower-level interfaces familiar to data analysts, data scientists, and data engineers. These interfaces vary depending on the selected deployment pattern. The Apache Spark deployment pattern relies primarily on Python notebooks, typically running within the data analytics environment, on which PySpark Python code is developed and bundled as a job that is submitted to the Spark cluster. The software as a service (SaaS) deployment pattern provides a visual modeling interface that supports configuring workflows by logically connecting data sources with analytic tools.

Additional applications such as reports, dashboards, and interactive mapping applications are often employed for visualizing and sharing analysis results. This is typically accomplished with a self-service mapping, analysis, and sharing system or other ArcGIS system pattern. Learn more about using, integrating, and composing system patterns.

Capabilities

The primary capabilities provided by a big data analytics system are introduced below. Capabilities used in big data analytics workflows, but typically provided by other systems, such as basemaps and other location services provided by a location services system are not listed below. Learn more about related system patterns.

Note:

Not all capabilities described below are available in all deployment patterns. See selecting a deployment pattern and the deployment pattern pages for more information on how these capabilities apply (or don’t apply) in various deployment contexts.

  • Data ingest enables data to be accessed by the big data analytics system when performing analysis tasks. In most cases data is analyzed directly at the source location; however, for certain scenarios the big data analytics system on SaaS may require ingesting data into the system.
  • Spatial joins and relationships enable rows from two datasets to be combined based on a spatial relationship. A variety of spatial relationships, including intersect, erase, union, identity, and symmetrical difference may be applied, though capabilities vary based on the selected deployment pattern.
  • Time steps and temporal relationships enable analysis using time. Time steps slice input data into steps on which analysis is performed independently and is available with the Apache Spark deployment pattern. Temporal relationships are used to join data temporally using the join tools and are supported by both deployment patterns.
  • Pattern analysis identifies spatial and temporal patterns in data. This includes tools such as find hotspots, find similar locations, and various regression-based analysis methods for modeling trends and generating predictions.
  • Proximity analysis looks at the proximity of spatial data to other spatial data. This includes tools such as find point clusters and creating buffers.
  • Summarization analysis aggregates or summarizes data into higher order data structures. This includes tools such as aggregate points, calculate density, and summarize within.
  • Track analysis works with time enabled points correlated to moving objects. This includes tools such reconstruct tracks, snap to network, and tools to analyze journeys and dwell locations.
  • Geocoding is the process of converting text to an address and a location. Geocoding tools in big data analytics systems are designed to work on large volumes of address data. Learn more about geocoding.
  • Network analysis helps solve common network problems, often (but not always) for street networks. The capabilities available for network analysis on a big data analytics system differ somewhat in scope from those available in traditional analytics systems. Additionally, the network analysis capabilities vary significantly between deployment patterns. Explore deployment patterns in more detail.
  • Raster analysis supports analytic functions and processors working against raster data. The capabilities available for raster analysis on a big data analytics system are relatively limited as compared to traditional analytics systems. Additionally, the raster analysis capabilities vary significantly between deployment patterns. Explore deployment patterns in more detail. Additionally, for more advanced raster and imagery analysis, see the imagery data management system pattern.
  • Data management supports operating on geometries and other fields in big data. This includes tools such as calculate field. The Apache Spark deployment pattern also includes many spatial SQL functions that extend the Spark SQL API.
  • Custom analysis tools are possible with a big data analytics system on Apache Spark, specifically through using the Big Data Toolkit (BDT) option. See the Apache Spark deployment pattern for more detail.
  • Mapping and visualization of analysis results is a powerful step to provide context and help uncover patterns, trends, and relationships. Visualizing and mapping is analogous to charting and plotting with non-spatial data. It’s a way to verify your analysis, iterate, and create shareable and engaging results. These interfaces for mapping and visualizing and analysis results vary depending on the selected deployment pattern; see applications for more information.
  • Data publishing and hosting of analysis results is supported by ArcGIS but is considered outside of the scope of the big data analytics system pattern. See related system patterns for more information.

Architecture considerations

This section describes in more detail how big data analytics systems align with and focus on specific aspects of the ArcGIS architecture.

For more detailed architecture considerations, see selecting a deployment pattern.

Data (persistence)

Big data analytics systems data architecture considerations

Big data analytics systems work with a wide variety of data stores, including file and object stores (often as distributed data lakes stores), relational databases, cloud data warehouses, as well as NoSQL document stores. The ArcGIS data models and rules may also be employed when working with certain data stores; however, this system type does not usually make use of industry-specific, ArcGIS data models. In most cases big data analytics systems works with the data in place, bringing the analytics close to the data; however, the SaaS deployment pattern may require data to be ingested into the Esri-hosted SaaS system. Lean more about how each deployment pattern works with data, and what data stores and sources it supports.

Services (logic)

Big data analytics systems services architecture considerations

Big data analytics systems make use of a narrow but deep set of ArcGIS services, specifically big data analytics, as well as AI and deep learning. The big data analytics system is most commonly used in support of AI and deep learning analysis for engineering data as well as training and testing deep learning models. Learn more about spatial analytics and data science.

The big data analytics system can also be used for querying, accessing, spatial referencing, enriching, and managing big data. Using this system for extract, transform, and load (ETL) workflows is possible, and relatively common. The big data analytics system makes use of interactive mapping with basemaps and reference layers for visualizing analysis results. The cataloging and sharing of analysis results and other content through portal services is typical, though this is typically accomplished through another ArcGIS-based system. See related system patterns for more information.

Applications (presentation)

Big data analytics systems application architecture considerations

Big data analytics systems typically expose only lower-level user interfaces familiar to data analysts, data scientists, and data engineers. These user interfaces, or applications, vary depending on the selected deployment pattern. See applications for more information.

Support

Big data analytics systems rely on distributed computing, with heavy emphasis on elasticity and scalability. For this reason, the majority of big data analytics systems tend to be cloud-based. Additional support considerations often include infrastructure efficiency and cost management, observability of long running analytic processes, as well as integration with data sources and other analytic or engagement systems. For more information on systems integration, see the integration pillar of the Well-Architected Framework. These systems tend not to be subject to performance or reliability SLAs.

For general support and architecture considerations, see architecture practices as well as the architecture pillars of the ArcGIS Well-Architected Framework.

Big data analytics systems may be integrated or combined with other ArcGIS system patterns. Some common examples include:

For more information on integrating or composing system patterns, see using system patterns.

Examples

Industry-specific system examples for this system pattern include:

  • Commercial. Organizations in commercial real estate, financial services, and retail sectors can utilize a big data analytics system pattern to speed up large-scale demographic analysis tasks. This could potentially include enriching data with all of Esri’s demographic variables, rather than only a few. Tasks like this can be run more quickly and frequently with this pattern, so organizations can get comprehensive, up-to-date demographic insights to inform their decisions.
  • Health and human services. The risk of diseases and other health issues can vary greatly by location. Researchers in health care and public health organizations can utilize a big data analytics system pattern to efficiently investigate correlated factors that influence health and disease transmission risk in their communities. Health organizations can also utilize a Big Data Analytics System to evaluate network adequacy.
  • Insurance. Insurers use spatial data to help manage risk and appropriately price their insurance policies. They can utilize a Big Data Analytics System Pattern to assess spatial relationships between hazards and policies, helping them balance exposure to risk. They are also interested in geoenabling vehicle telemetry data they collect using OBD2 devices, so they can gain insights into driver behavior. For example, they can identify safe drivers that select the safest possible routes and comply with posted speed limits, then reward these drivers with lower insurance premiums.
  • National government. National agencies often collect extremely large troves of data about social, economic, and environmental activity. Using a big data analytics system pattern, they can analyze this data to quickly investigate and understand time-critical patterns and activities of interest. For example, they can identify dwell locations (places where people spend time), spatial clusters (places where people gather), and anomalies (like unexpected changes and activity).
  • Natural resources. With a big data analytics system pattern, oil and gas companies can apply data they create for their digital twins to create what-if scenarios, identify anomalies (like broken assets), and model relationships using their Spark big data infrastructure. These companies can also use historical GPS tracks to detect lease roads (which are not part of a public road network), then connect them with public roads. Users can apply that road data to optimally sequence inspection sites, reducing the amount of time their employees need to spend on the road during inspections (also known as windshield time).
  • State and local government. State and local agencies rely on data to help them provide effective services to citizens. With a big data analytics system pattern, they can understand historical data related to their services, such as 311 call histories, vehicle telemetry data, and more. This lets them answer questions about their level of responsiveness to citizen complaints and assess the performance of service providers.
  • Telecommunications. With a big data analytics system pattern, telcos can analyze call records to identify problems and anomalies in the network, such as a statistically significant hotspot with a high accumulation of dropped calls. They can also fuse demographic data with data from Wi-Fi access hotspots to extract inferences about caller characteristics and behavior. They may also be interested in selling this behavioral data to external customers, like social media companies.
  • Transportation. Connected vehicles (like cars and trains) collect telemetry data to help improve the operation of the vehicle. With a big data analytics system pattern, vehicle manufacturers (and developers of onboard systems) can run analytics against historical telemetry to gain insight into real-world operating conditions. They can then use these insights to improve travel time estimates, road and navigation data, and other services related to vehicles and fleets. Some organizations may also be interested in selling their telemetry data and analytic insights to third parties.
  • Utilities. Utilities can use a big data analytics system pattern to review historical usage and outage information, then correlate that data with weather patterns and other local conditions to understand which factors drive higher usage and increase outage risk. This helps them improve usage forecasting, prioritize preventative maintenance, and predict customer service needs.
Top