Data pipelines and ETLs

Most organizations store important data across multiple different systems, and data loading or integration tools like pipelines and extract, transform and load (ETL) tools are critical to moving data between these different systems so it can be combined with and analyzed against other sources or used to keep other relevant data up to date.

Many different technical solutions can be used for moving data, from manual solutions where data is copied on demand to automated systems that perform complicated data transformations.

Use of the term “integration” in relation to data pipelines or ETLs is complicated, as integrations usually imply a closer, or more real-time access to data. ETLs are an appropriate tool for moving data between systems when changes either happen infrequently, or when access to the data does not rely on having the absolute latest copy of the data, such as when a nightly sync is sufficient. At this level, they fulfill many integration requirements even though the connection is asynchronous in nature.

ArcGIS Data Pipelines

ArcGIS Data Pipelines is an ArcGIS Online capability that provides built-in integration support for connecting your data with ArcGIS. With Data Pipelines, you can connect to and read data from where it is stored, perform data preparation operations, and write the data out to a hosted feature layer that is readily available in ArcGIS. You can use the no-code Data Pipelines interface to construct, run, schedule and reproduce your data preparation workflows.

Data Pipelines works with vector data (for example, points, lines, and polygons) and tabular data (for example, data represented as a table). You can connect to a variety of data sources including Amazon S3, Google BigQuery, Snowflake, feature layers, uploaded local files, and more. Once connected, you can use tools to blend, build, and integrate datasets for use in your workflows.

The tools available in Data Pipelines are categorized in tool sets with capabilities such as clean, construct, integrate, and format. For example, the following workflows are supported by Data Pipelines tools:

  • Manipulate dataset schemas by updating or adding fields
  • Select a subset of fields to extract targeted information
  • Find and replace attribute values to clean or simplify the data
  • Combine datasets using join or merge functionality
  • Calculate fields using Arcade functions
  • Create geometry or time fields for use in spatial or temporal analysis
  • Un-nest JSON objects in a column
  • Map fields between schemas

While building a data pipeline and configuring tools, you can preview results at each processing step as a map or table, optionally sorting table columns to inspect rows, for example minimum, maximum and null values. Iterative inspection and application of data preparation tools can help to improve data quality and assurance throughout the process.

Once you’ve completed configuration of the data pipeline, you can run it to create or update an ArcGIS feature layer that will be available in your content. You can configure geometry and time properties for the output feature layer so it’s ready for use in additional workflows such as spatial or temporal analysis, dashboards, or web maps.

Data Pipelines can be a suitable choice when:

  • A no-code to low-code solution is attractive.
  • You create or maintain hosted feature layers in ArcGIS Online.
  • The output services contain a single feature layer.
  • The output layers are not the origin of relationships.
  • A SaaS tool managed with your organization’s content is desired.
  • Data quality is discoverable by preview.
  • Data size is moderate (up to several millions, depending on row width)

For more information on Data Pipelines, see Introduction to Data Pipelines.

ArcGIS Data Interoperability

The ArcGIS Data Interoperability extension is a spatial extract, transform, and load (ETL) tool set for both ArcGIS Pro and ArcGIS Enterprise that runs either through standalone tools or in the geoprocessing framework. Data Interoperability is built from and is compatible with Safe Software’s FME product.

Data Interoperability provides users with a no-code authoring experience using the Workbench desktop app, with work saved to local workbench files, or as Spatial ETL tools in a project toolbox. Workbenches can connect to a wide variety of data sources, perform simple or complex operations on data in transit, and then write data to a wide variety of destinations. This is a fully featured ETL system, where workbenches can be run manually for inspection or testing, automated to run on a regular basis, or invoked as part of a geoprocessing service.

Data Interoperability provides three desktop apps (Workbench, Data Inspector and Quick Translator) that can be accessed from ArcGIS Pro. ETL tool results can also be shared to ArcGIS Server as web tools like any other geoprocessing result. This lets you publish ETLs which can be accessed as web services, for example in a Web AppBuilder geoprocessing widget.

Data Interoperability may be a suitable option when:

  • A no-code solution is desired.
  • You need to read or write otherwise unsupported data types.
  • You need to output complex data models
  • Geoprocessing integration is desired.
  • Deep inspection of data is needed to uncover quality issues.
  • Deep support for complex data types is needed (e.g. Excel)

For more information on Data Interoperability, see the following resources:

Other ETL options

An additional common approach to building ETLs is to use Python, usually in combination with arcpy and the arcgis module of the ArcGIS Python API. The Python developer community, along with many commercial providers, have created hundreds of libraries that assist with connecting to almost any imaginable data source, from database clients to storage types, file parsers, and web client libraries. This means that any data source you can connect to, can be pulled into Python, transformed or reshaped, and pushed to an output, which in an ArcGIS-based system is often an editable feature layer, image file or 3D dataset.

A popular pattern is using a Notebook that takes advantage of industry standard data access protocols. For example, when integrating with a cloud warehouse, install the warehouse’s PEP 249-compliant database access package so the Notebook can use standard SQL or cursor-based processing.

An emerging pattern is to write Python-based ETLs that take advantage of remote file access over HTTP using the S3 API that is now common among industry storage providers. This pattern applies to cloud object stores hosting well known file types like Apache Parquet, GeoParquet, JSON or CSV, and which is enabled by extending the Python environment. This approach allows files be read or written as if they are in a SQL-compliant database, with support for partitioning a single logical table across any number of physical files to take advantage of parallel processing across the set of files. Together, these two approaches allow you to perform ETL against cloud warehouses and cloud object stores with powerful performance and a complete feature set.

A Python-based ETL may be suitable when:

  • Your organization has the Python skills and bandwidth to maintain a codebase.
  • You are looking to maximize and carefully manage ETL performance.
  • You need a special, otherwise unsupported, capability or connectivity to a specific data source or provider with a publicly available Python module.

Many other ETL offerings are available from cloud providers, independent software vendors, and through open-source tools. Each of these offerings will support different inputs, processors, and outputs, so it is important to consider the data movement requirements for your system before choosing an offering. Both ArcGIS Data Pipelines and ArcGIS Data Interoperability offer a no-code experience and include pre-built connections to write data into ArcGIS, which can be an important efficiency boost when creating a system that implements the no-code pattern.