Data lakes and data warehouses

Many organizations have established strategies around data consolidation or transformation over the last several years or are actively engaged in such an activity today. These strategies are often driven by intersecting priorities, including the need to break down siloed data systems, enable integrated, rapid cross-dataset analysis, and work effectively at scale as data volume and complexity grow.

Two categories of technology are often implemented in some combination to assist this transformation – data lake technology and data warehouse technology, sometimes used in conjunction. The technologies of data lakes and data warehouses are often conflated, and may be used interchangeably in informal discussions, but at both conceptual and implementation levels they serve different purposes and have different roles in an enterprise information system. This page will attempt to describe these technologies at a general, vendor and product-agnostic level, and explain Esri’s strategic approach to working with these technologies from an ArcGIS system perspective.

Data lakes

Data lakes can vary greatly in definition and construction, but as a general concept, a data lake is a centralized repository for individual data items (files) of various types, which are stored in one large, scalable storage system that allows for indexing, cataloging, and running analytics across these files. Data can be structured or unstructured, stored in various file formats, and organized according to the inbound data streams or other systems that create that data. Data is then accessed, visualized, and analyzed using tools and capabilities that that data lake provider makes available through software.

For example, an organization might use a data lake to store thousands of CSV files that represent individual hourly tracks of vehicle location, speed, and direction. Data lake software might support running batch analytics across these thousands of files to extract scenarios where speed exceeded a certain amount or detect trends in locations where frequent stops occur. Traditional data systems would require merging these many files into one dataset which can then be accessed, but a data lake allows for this query and analytics experience to operate at scale across the massive pool of data. In the same data lake, the same organization might store a different format of data that summarizes customer profiles and sales data based on a file-based archive format, so that previous activity can be correlated to the live, transactional database of active orders or customers.

Working with data lakes in ArcGIS

Data lakes are primarily accessed from ArcGIS as a source of data for asking analytical questions, with the results then viewed in a spatial or map-based interface. As these storage systems usually contain very large collections of data in similarly structured files, the processes that work with this data are often summarizing the contents, running a geospatial analytic, or comparing these datasets to other spatial or non-spatial layers to answer a specific analytical question. Generally, these analyses are processing-intensive, so the workflow usually involves:

  1. Design or define the analytical question
  2. Run the analytics using an interface provided by the data lake or an external processing system like Apache Spark
  3. View the summarized results of the analytics in tabular form or as a spatial layer

Data Lakes can also be used as the source of raster files that are accessed through a cloud connection file from ArcGIS Pro or as part of a mosaic dataset. These raster files can be added to a mosaic dataset and used to publish an image service, as a source for analyses that are run using raster analytics, viewed in ArcGIS Pro, and used for geoprocessing, analytics, or rendering workflows.

Some example data lake workflows using ArcGIS might include:

  • Run a hot spot analysis across thousands of CSV files that contain bird locations collected over a multi-year period using ArcGIS GeoAnalytics Engine or ArcGIS GeoAnalytics in ArcGIS Pro.
  • As an environmental scientist, use the ArcGIS Python API to identify times and locations of high ozone levels across the country in a dataset of millions of static sensor reads, using tools like Detect Incidents and Create Space Time Cube.

Data warehouses

Data warehouses are another type of storage system that can vary in design, definition, and construction. As a general concept, a data warehouse is most similar to a relational database management system, that allows for storage of large, structured datasets, with cross-dataset query, analytic and summarization functionality. A data warehouse generally differs from a traditional relational database system in the scale of data it can support, the type and variety of analytics that can be performed, and the speed with which these processes can be completed.

Data warehouses are also often built in a more cloud-native configuration or provided in a software-as-a-service model, where clients connect to a system managed by the company who builds the data warehouse technology, using compute capacity and storage that are also hosted and managed by that provider. Another common component of data warehousing is the use of non-relational data models, such as a star model, dimensioned data model, or other, similar concepts.

Working with data warehouses in ArcGIS

Working with data warehouses from ArcGIS can take several different forms. The most common pattern starts with a connection from ArcGIS Pro to the data warehouse to run a query against a table, view, or dataset. This is created through a query layer, which is a layer type in ArcGIS Pro that can run user-defined SQL against a supported database (common examples include SQL Server-based data warehouses, PostgreSQL-based data warehouses, Snowflake, Google BigQuery or AWS Redshift). For more information, see the ArcGIS Pro documentation.

The query layer returns results from the database as a table, which can be displayed on the map if it contains a spatial column that is recognized by ArcGIS, and can then be used for visualization, as an input to an analysis, or as an input for cartographic map creation. This layer is a live connection to the warehouse, so a new query is sent on each change of map extent, which returns a new set of rows, potentially reflecting updated source data, an updated calculation or simply a new spatial extent.

If web-based access is required, this query layer can be published into a dynamic map service to an ArcGIS GIS Server site, which will carry over any symbology or definition from the ArcGIS Pro map into the service configuration, but now each user request will trigger an updated SQL query from ArcGIS Server to the data warehouse.

Data warehouses are optimized for large, analytical, or summary queries, allowing data owners to answer questions like “what was the average purchase size in different product categories across several million transactions at our stores in the past 24 hours?” A query like this generally runs on a regular basis, with the results used to power a dashboard, data summary or chart until they are updated on a later, regular schedule. Data warehouses can also be used for more iterative, exploratory analysis by a data analyst or data scientist, usually to define and then re-use a summary statistic or report.

For this reason, query layers in ArcGIS that connect to data warehouses should most often query for the results of such an analysis, rather than a specific set of transactional rows (such as a list of all million transactions in the 24-hour period). While data warehouses can functionally be queried row-by-row, they are not optimized for this kind of transactional interaction and can lead to frustrating experiences such as trying to query a million rows to display on a map, which might result in a five minute response time to return (and display) all of the requested rows.

Once a data warehouse-backed query layer is published to ArcGIS Enterprise, for certain providers ArcGIS can implement additional performance optimization. With cloud data warehouses, ArcGIS has introduced additional logic to assist with the efficient processing of queries and usage of compute resources, specifically for AWS Redshift, Google BigQuery and Snowflake databases which can impose a cost based on total compute consumption. During publishing, ArcGIS can (optionally) automatically create a materialized view in the source data warehouse, which can improve performance and save on overall cost of queries. Another option is a “snapshot” mode where a copy of the data is moved to the ArcGIS Data Store during publishing, then regularly kept up to date on a schedule set by the publisher, so that the GIS system always has a recent set of results, but relatively slow queries are not continuously sent to the source data warehouse.

Other methods for integration with data warehouses also exist, but are less common, including:

  • ETL-style integrations, where the results of a data warehouse query or view are copied to another data format in the enterprise GIS system on a regular basis
  • API-style integrations, where an ArcGIS web application or client application can query a Data Warehouse HTTP endpoint to return results or values