Data lakes and data warehouses

Many organizations have established strategies around data consolidation or transformation over the last several years or are actively engaged in such an activity today. These strategies are often driven by intersecting priorities, including the need to break down siloed data systems, enable integrated, rapid cross-dataset analysis, and work effectively at scale as data volume and complexity grow.

Two categories of technology are often implemented in some combination to assist this transformation – data lake technology and data warehouse technology, sometimes used in conjunction. The technologies of data lakes and data warehouses are often conflated, and may be used interchangeably in informal discussions, but at both conceptual and implementation levels they serve different purposes and have different roles in an enterprise information system. This page will attempt to describe these technologies at a general, vendor and product-agnostic level, and explain Esri’s strategic approach to working with these technologies from an ArcGIS system perspective.

Data lakes

Data lakes can vary greatly in definition and construction, but as a general concept, a data lake is a centralized repository for individual data items (files) of various types, which are stored in one large, scalable storage system that allows for indexing, cataloging, and running analytics across these files. Data can be structured or unstructured, stored in various file formats, and organized according to the inbound data streams or other systems that create that data. Data is then accessed, visualized, and analyzed using tools and capabilities that the data lake provider makes available through software.

For example, an organization might use a data lake to store thousands of CSV files that represent individual hourly tracks of vehicle location, speed, and direction. Data lake software might support running batch analytics across these thousands of files to extract scenarios where speed exceeded a certain amount or detect trends in locations where frequent stops occur. Traditional data systems would require merging these many files into one dataset which can then be accessed, but a data lake allows for this query and analytics experience to operate at scale across the massive pool of data. In the same data lake, the same organization might store a different format of data that summarizes customer profiles and sales data based on a file-based archive format, so that previous activity can be correlated to the live, transactional database of active orders or customers.

Working with data lakes in ArcGIS

Data lakes are primarily accessed from ArcGIS as a source of data for asking analytical questions, with the results then viewed in a spatial or map-based interface. As these storage systems usually contain very large collections of data in similarly structured files, the processes that work with this data often summarize the contents before running a geospatial analytic or comparing these datasets to other spatial or non-spatial layers to answer a specific analytical question. Generally, these analyses are processing-intensive, so the workflow usually involves several steps:

  1. Carefully design or define the analytical question, often using a subset of the data or a single example file or dataset. Optimize this analysis as much as possible if it will be run against a very large data layer, and reduce the output columns and data as much as possible to increase efficiency.
  2. Run the analytics using an interface provided by the data lake or an external processing system like Apache Spark. Results are often returned to an in-memory data frame or dataset within that interface
  3. View the summarized results of the analytics in tabular form or as a spatial layer to review the answers to the analytical question. Optionally, persist the results out to another format or repository, by publishing a CSV file, creating an ArcGIS hosted feature layer, or pushing results to another API or system.

Data Lakes can also be used as a repository of imagery datasets or raster files that are accessed through a cloud connection file from ArcGIS Pro or as part of a mosaic dataset. Data lake imagery can be added to a mosaic dataset and used to publish an image service, as a source for analyses that are run using raster analytics, viewed in ArcGIS Pro, and used for geoprocessing, analytics, or rendering workflows.

Some example data lake workflows using ArcGIS might include:

  • Completing a hot spot analysis across thousands of CSV files that contain bird locations collected over a multi-year period using ArcGIS GeoAnalytics Engine or ArcGIS GeoAnalytics in ArcGIS Pro.
  • As an environmental scientist, use the ArcGIS Python API to identify times and locations of high ozone levels across the country in a dataset of millions of static sensor reads, using tools like Detect Incidents and Create Space Time Cube.

Data warehouses

Data warehouses are another type of storage system that can vary in design, definition, and construction. As a general concept, a data warehouse is most similar to a relational database management system, that allows for storage of large, structured datasets, with cross-dataset query, analytic and summarization functionality. A data warehouse generally differs from a traditional relational database system in the scale of data it can support, the type and variety of analytics that can be performed, and the speed with which these processes can be completed.

Data warehouses are also often built in a more cloud-native configuration or provided in a software-as-a-service model, where clients connect to a system managed by the company who builds the data warehouse technology, using compute capacity and storage that are also hosted and managed by that provider. Another common component of data warehousing is the use of non-relational data models, such as a star model, dimensioned data model, or other, similar concepts.

Working with data warehouses in ArcGIS

Working with data warehouses from ArcGIS can take several different forms. The most common pattern starts with a connection from ArcGIS Pro to the data warehouse to run a query against a table or view. This is created through a query layer, which is a layer type in ArcGIS Pro that can run user-defined SQL against a supported database. For a list of supported databases and cloud data warehouses, see the ArcGIS Pro documentation.

The query layer returns results from the database as a table, which can be displayed on the map if it contains a spatial column that is recognized by ArcGIS, and can then be used for visualization, as an input to an analysis, or as an input for cartographic map creation. This layer is a live connection to the warehouse, so a new query is sent on each change of map extent, which returns a new set of rows, potentially reflecting updated source data, an updated calculation or simply a new spatial extent.

If web-based access is required, this query layer can be published to ArcGIS Enterprise as a dynamic map service. This service will carry over any symbology or definition from the ArcGIS Pro map into the service configuration. Each user request will trigger an updated SQL query from ArcGIS Enterprise to the data warehouse.

Data warehouses are optimized for large, analytical, or summary queries, allowing data owners to answer questions like “what was the average purchase size in different product categories across several million transactions at our stores in the past 24 hours?” A query like this generally runs on a regular basis, with the results used to power a dashboard, data summary or chart until they are updated on a later, regular schedule. Data warehouses can also be used for more iterative, exploratory analysis by a data analyst or data scientist, usually to define and then re-use a summary statistic or report.

For this reason, query layers in ArcGIS that connect to data warehouses should most often query for the results of such an analysis, rather than a specific set of transactional rows (such as a list of all million transactions in the 24-hour period). While data warehouses can functionally be queried row-by-row, they are not optimized for this kind of transactional interaction and can lead to frustrating experiences such as trying to query a million rows to display on a map, which might result in a lengthy response time to return (and display) all the requested rows.

Query layers can be further optimized by reducing the number of queries that are made directly against the data. Executing fewer queries can reduce your cloud data warehouse costs because these services generally incur costs based on the compute resources used. When publishing layers from data in a cloud data warehouses, ArcGIS can optionally create a materialized view. The materialized view can be refreshed after the underlying data are updated to incorporate those updates into the view. Using materialized views means you only incur cloud data warehouse compute costs when you refresh the view, not every time you access the data. Using pre-computed materialized views is also typically faster than executing a query against the data directly. Materialized views are particularly useful for datasets that are updated infrequently and for complex queries that require significant computational resources.

Another strategy for reducing the number of queries made to the cloud data warehouse is to create a snapshot of the data when publishing a query layer. A snapshot copies the query result set from the cloud data warehouse to ArcGIS Enterprise. The layer references this copy instead of querying the cloud data warehouse, which generally improves the performance of the layer. Snapshots are not updated automatically when the underlying data are updated. In ArcGIS Enterprise, you can refresh the snapshot on demand or schedule a refresh at specific intervals.

Other methods for integration with data warehouses also exist, but are less common, including:

  • ETL-style integrations, where the results of a data warehouse query or view are copied into ArcGIS on a regular basis. These integrations might perform this work using ArcGIS Data Pipelines, the ArcGIS Python libraries, ArcGIS Notebooks, or the ArcGIS Data Interoperability extension. An advantage of ETL-style integrations is that they allow you to bring data from data warehouses into ArcGIS Online as a hosted feature layer or table.
  • API-style integrations, where an ArcGIS web application or client application can query a Data Warehouse to return results or values. This may be possible with ArcGIS Maps SDKs, where a client-side Feature Layer or Graphics layer could be created from the results. The ArcGIS Enterprise SDK could also be used to create a custom data feed for feature services. In both of these scenarios, authentication, data update frequency, and access controls are relevant discussions and requirements to consider.
Top