Data pipelines and ETLs

Most organizations store important data across multiple different systems, and data loading or integration tools like pipelines and extract, transform and load (ETL) tools are critical to moving data between these different systems so it can be combined with and analyzed against other sources or used to keep other relevant data up to date.

Many different technical solutions can be used for moving data, from manual solutions where data is copied on demand to automated systems that perform complicated data transformations.

Use of the term “integration” in relation to data pipelines or ETLs is complicated, as integrations usually imply a closer, or more real-time access to data. ETLs are an appropriate tool for moving data between systems when changes either happen infrequently, or when access to the data does not rely on having the absolute latest copy of the data, such as when a nightly sync is sufficient. At this level, they fulfill many integration requirements even though the connection is asynchronous in nature.

ArcGIS Data Pipelines

ArcGIS Data Pipelines is an ArcGIS Online capability that provides built-in integration support for connecting your data with ArcGIS. With Data Pipelines, you can connect to and read data from where it is stored, perform data preparation operations, and write the data out to a hosted feature layer that is readily available in ArcGIS. You can use the no-code Data Pipelines interface to construct, run, schedule and reproduce your data preparation workflows.

Data Pipelines works with vector data (for example, points, lines, and polygons) and tabular data (for example, data represented as a table). You can connect to a variety of data sources including Amazon S3, Google BigQuery, Snowflake, feature layers, uploaded local files, and more. Once connected, you can use tools to blend, build, and integrate datasets for use in your workflows.

The tools available in Data Pipelines are categorized in tool sets with capabilities such as clean, construct, integrate, and format. For example, the following workflows are supported by Data Pipelines tools:

Manipulate dataset schemas by updating or adding fields
Select a subset of fields to extract targeted information
Find and replace attribute values to clean or simplify the data
Combine datasets using join or merge functionality
Calculate fields using Arcade functions
Create geometry or time fields for use in spatial or temporal analysis
Un-nest JSON objects in a column
Map fields between schemas

While building a data pipeline and configuring tools, you can preview results at each processing step as a map or table, optionally sorting table columns to inspect rows, for example minimum, maximum and null values. Iterative inspection and application of data preparation tools can help to improve data quality and assurance throughout the process.

Once you’ve completed configuration of the data pipeline, you can run it to create or update an ArcGIS feature layer that will be available in your content. You can configure geometry and time properties for the output feature layer so it’s ready for use in additional workflows such as spatial or temporal analysis, dashboards, or web maps.

Data Pipelines can be a suitable choice when:

A no-code to low-code solution is attractive.
You create or maintain hosted feature layers in ArcGIS Online.
The output services contain a single feature layer.
The output layers are not the origin of relationships.
A SaaS tool managed with your organization’s content is desired.
Data quality is discoverable by preview.
Data size is moderate (up to several millions, depending on row width)

For more information on Data Pipelines, see Introduction to Data Pipelines.

ArcGIS Data Interoperability

The ArcGIS Data Interoperability extension is a spatial extract, transform, and load (ETL) tool set for both ArcGIS Pro and ArcGIS Enterprise that runs either through standalone tools or in the geoprocessing framework. Data Interoperability is built from and is compatible with Safe Software’s FME product.

Data Interoperability provides users with a no-code authoring experience using the Workbench desktop app, with work saved to local workbench files, or as Spatial ETL tools in a project toolbox. Workbenches can connect to a wide variety of data sources, perform simple or complex operations on data in transit, and then write data to a wide variety of destinations. This is a fully featured ETL system, where workbenches can be run manually for inspection or testing, automated to run on a regular basis, or invoked as part of a geoprocessing service.

Data Interoperability provides three desktop apps (Workbench, Data Inspector and Quick Translator) that can be accessed from ArcGIS Pro. ETL tool results can also be shared to ArcGIS Server as web tools like any other geoprocessing result. This lets you publish ETLs which can be accessed as web services, for example in a Web AppBuilder geoprocessing widget.

Data Interoperability may be a suitable option when:

A no-code solution is desired.
You need to read or write otherwise unsupported data types.
You need to output complex data models
Geoprocessing integration is desired.
Deep inspection of data is needed to uncover quality issues.
Deep support for complex data types is needed (e.g. Excel)

For more information on Data Interoperability, see the following resources:

For ArcGIS Pro: ArcGIS Data Interoperability extension
For ArcGIS Enterprise: ArcGIS Data Interoperability server extension

Other ETL options

An additional common approach to building ETLs is to use Python, usually in combination with arcpy and the arcgis module of the ArcGIS Python API. The Python developer community, along with many commercial providers, have created hundreds of libraries that assist with connecting to almost any imaginable data source, from database clients to storage types, file parsers, and web client libraries. This means that any data source you can connect to, can be pulled into Python, transformed or reshaped, and pushed to an output, which in an ArcGIS-based system is often an editable feature layer, image file or 3D dataset.

A popular pattern is using a Notebook that takes advantage of industry standard data access protocols. For example, when integrating with a cloud warehouse, install the warehouse’s PEP 249-compliant database access package so the Notebook can use standard SQL or cursor-based processing.

An emerging pattern is to write Python-based ETLs that take advantage of remote file access over HTTP using the S3 API that is now common among industry storage providers. This pattern applies to cloud object stores hosting well known file types like Apache Parquet, GeoParquet, JSON or CSV, and which is enabled by extending the Python environment. This approach allows files be read or written as if they are in a SQL-compliant database, with support for partitioning a single logical table across any number of physical files to take advantage of parallel processing across the set of files. Together, these two approaches allow you to perform ETL against cloud warehouses and cloud object stores with powerful performance and a complete feature set.

A Python-based ETL may be suitable when:

Your organization has the Python skills and bandwidth to maintain a codebase.
You are looking to maximize and carefully manage ETL performance.
You need a special, otherwise unsupported, capability or connectivity to a specific data source or provider with a publicly available Python module.

Many other ETL offerings are available from cloud providers, independent software vendors, and through open-source tools. Each of these offerings will support different inputs, processors, and outputs, so it is important to consider the data movement requirements for your system before choosing an offering. Both ArcGIS Data Pipelines and ArcGIS Data Interoperability offer a no-code experience and include pre-built connections to write data into ArcGIS, which can be an important efficiency boost when creating a system that implements the no-code pattern.

ETL and Data Migration Strategies

This section provides several example strategies for working with data migration workflows, using several different technologies and with useful tips and guidelines within each strategy. These samples should be used to build your organizational data migration workflows, not used as-is or without review.

Local well-known files

This pattern relies on ArcGIS Pro only.

One of the most common shared file formats is CSV – comma separated values, also known as ASCII or Text files, followed closely by Microsoft Excel workbooks. CSV and Excel files are popular for their ease of sharing, reading (including by people) and writing in many applications. CSV and Excel files with coordinate columns may be used directly in Pro, possibly after coordinate notation conversion or geocoding. The help topics linked above provide good guidance on standard file usage. However, in an ETL context – where you want to reach then prepare data for use - there are other common problems to be overcome:

Files may be stored on the web, or generated from an API request URL
- In these cases, the file path may not validate in ArcGIS Pro and the file should be downloaded for use
Files have their schema (column names and types) inferred from the data
- You may want to override the default schema with specific field data type definitions
Coordinate columns may initially be stored in an unsupported notation
Data values may be encoded, for example storing null values as zeros

This sample shows how to handle these issues with minimal effort using core tools.

Cloud native or cloud hosted well-known files

This pattern relies on ArcGIS Pro along with an extended Python Notebook environment.

Cloud object stores such as Amazon S3, Google Cloud Storage, Microsoft Azure Blob Storage, MinIO, or others offer a cost-effective and performant file sharing mechanism, with support for secured or public access. Consider object storage differently from a ‘remote filesystem’ in cloud storage, which for ETL purposes can be considered local data. Currently, cloud store connections in Pro support only raster data, so the catalog experience in ArcGIS Pro, including folder connections, does not display cloud-hosted vector data such as Apache Parquet or GeoParquet, JSON, GeoJSON, CSV or Excel, all common ETL source formats. This limitation can be overcome around by extending the Pro Python environment to provide virtual filesystem support for cloud stores and to treat remote files as relational database objects, namely DuckDB. Data publishers may work at any scale, as may you, so cloud hosted well-known file ETL needs to support:

Importing single files
Importing any number of files simultaneously
Importing spatial or logical views of any number of files

This sample shows how to handle these requirements with the same flexible tooling.

Maintaining feature layers with ArcGIS Data Pipelines

This pattern uses Data Pipelines to maintain existing feature layers created through other means.

ArcGIS Data Pipelines can stand on its own as an ETL solution, but there are two cases where it may be considered as a maintenance tool for hosted feature layers created by other means:

Legacy hosted feature layers are already in production
Complexities like multilayer services or relationship classes exist
- Data Pipelines cannot create services with these properties

Existing workflows may be adequate or even necessary to create your layers but not offer Data Pipelines’ advantages of ease of use, organizational sharing model, SaaS hosting and scheduling capability. In this situation you can consider Data Pipelines as a maintenance tool.

This sample shows a simple example of hosted layer maintenance.

ArcGIS Online notebooks and cloud-native GeoParquet

This pattern uses a hosted Python Notebook in ArcGIS Online with a Python runtime environment extended to access a cloud object store containing data in GeoParquet format.

Extending the runtime environments for ArcGIS Pro and ArcGIS Enterprise Notebook Server are common patterns, but a less well-known pattern can accomplish the same result in an ArcGIS Online hosted notebook. The available Python libraries in the standard Python runtimes include the conda and pip package managers, with pip generally being the simpler option. If you find yourself in need of connectivity to an external system that offers a Python interface, then it is likely to be available and simply installed from PyPi.

This sample uses pip to extend the local notebook runtime and DuckDB to access data from the Overture Maps Foundation’s Places theme – an example of accessing an Amazon S3 object store by hive partition query – meaning in this case any number of GeoParquet files are accessed by wildcard query. Once accessed and queried, the data is used to refresh a hosted feature layer of places of interest in London, England (and could support any other worldwide location).

ArcGIS Online notebooks and a cloud data warehouse

This pattern uses an Online Notebook with a runtime environment extended to support read or write of Snowflake. Any cloud warehouse may be accessed similarly.

This pattern is similar to ArcGIS Online notebooks and cloud-native GeoParquet in that it uses an extended runtime container, but it adds the processing power of a cloud warehouse. ArcGIS Pro and Enterprise support connectivity to Snowflake and other cloud data warehouses to create query layers, which can be used as input to geoprocessing tools or visualized in ArcGIS Pro maps or web maps, but using this pattern brings the power of SQL to your Online notebook processing. In addition to a cloud warehouse being a database, it is also a data sharing ecosystem – other organizations can share data of any scale into the warehouse and combine those data to generate insights and new results.

This sample maintains a hosted feature layer of places of interest in San Francisco, California (or anywhere else worldwide you define). The data is sourced from Esri business partner SafeGraph’s Places dataset.

Metadata automation

This pattern uses ArcGIS Pro and the ArcGIS Data Interoperability extension to automate the creation or import of metadata during ETL processing.

Documenting the content and processing steps of items in ArcGIS is key to usability and trust, but it is an easily overlooked activity, making the automation of generative steps very valuable. In ETL, you are often working with external data, another common disconnection with ArcGIS metadata flow.

This sample shows how to build metadata creation or harvesting into ETL workflows that maintain a hosted feature service, but are equally valid with other outputs, such as geodatabases.

OLE DB and ODBC Connections

This pattern uses ArcGIS Pro and OLE DB connections to connect to an external ODBC-compatible data source.

While ArcGIS supports many popular database platforms, more connections may be made in ArcGIS Pro to relational data, local or remote, using industry standard OLE DB data sources. These extend the database connection options in ArcGIS Pro, making additional data sources accessible for mapping, data engineering and analytic workflows, including ETL. Connections may be shared across your network. If you need to work with data in an external database, OLE DB connections are a simple and usually available option. Beyond any built-in support in ArcGIS, some cloud data lakes and warehouses also offer ODBC drivers to simplify connectivity.

This sample shows how ArcGIS tools can use an OLE DB connection to enrich unsupported tabular data, applying a desired schema, and sending the result persisted as a geodatabase feature class.

Accessing remote APIs

This pattern uses ArcGIS Pro and the ArcGIS Data Interoperability extension to interact with a API that returns large responses through pagination - in chunks of JSON which then need parsing to be parsed.

HTTP-based REST APIs often return a JSON response, popular for its combination of readability (by people or machines) and flexibility, but the JSON objects are loosely typed and often unstructured and unordered. For responsiveness reasons, APIs may return a logical dataset in pieces (“pages”), generated sequentially, but which must be aggregated at the client. These two challenges of pagination and parsing into a defined schema are easily solved with a low-code strategy.

This sample shows how to interact with a paging API to synchronize data daily with an ArcGIS Online hosted feature layer.

Scheduling ETLs as web tools

This pattern uses ArcGIS Pro, ArcGIS Enterprise web tools, and optionally ArcGIS Data Interoperability.

Perhaps the most prevalent organizational approach for any work, computerized or not, is repetition on a schedule. Beginning with ArcGIS Pro 3.4 and ArcGIS Enterprise 11.4, web tools (geoprocessing services) authored in ArcGIS Pro may be also scheduled in ArcGIS Pro. Geoprocessing tools that do ETL work may be authored as Python script tools, ModelBuilder models or Data Interoperability spatial ETL tools, shared as web tools, then scheduled from the tool’s Run menu in ArcGIS Pro. Key to making this pattern work is making sure your input and output dataset parameters are registered with or accessible to ArcGIS Enterprise and not uploaded as static copies at publication time.

This sample provides an example of a spatial ETL web tool scheduled to run twice daily to mirror an RSS feed to an ArcGIS Online hosted feature layer.