Most organizations store important data across multiple different systems, and data loading or integration tools like pipelines and extract, transform and load (ETL) tools are critical to moving data between these different systems so it can be combined with and analyzed against other sources or used to keep other relevant data up to date.
Many different technical solutions can be used for moving data, from manual solutions where data is copied on demand to automated systems that perform complicated data transformations.
Use of the term “integration” in relation to data pipelines or ETLs is complicated, as integrations usually imply a closer, or more real-time access to data. ETLs are an appropriate tool for moving data between systems when changes either happen infrequently, or when access to the data does not rely on having the absolute latest copy of the data, such as when a nightly sync is sufficient. At this level, they fulfill many integration requirements even though the connection is asynchronous in nature.
ArcGIS Data Pipelines is an ArcGIS Online capability that provides built-in integration support for connecting your data with ArcGIS. With Data Pipelines, you can connect to and read data from where it is stored, perform data preparation operations, and write the data out to a hosted feature layer that is readily available in ArcGIS. You can use the no-code Data Pipelines interface to construct, run, schedule and reproduce your data preparation workflows.
Data Pipelines works with vector data (for example, points, lines, and polygons) and tabular data (for example, data represented as a table). You can connect to a variety of data sources including Amazon S3, Google BigQuery, Snowflake, feature layers, uploaded local files, and more. Once connected, you can use tools to blend, build, and integrate datasets for use in your workflows.
The tools available in Data Pipelines are categorized in tool sets with capabilities such as clean, construct, integrate, and format. For example, the following workflows are supported by Data Pipelines tools:
While building a data pipeline and configuring tools, you can preview results at each processing step as a map or table, optionally sorting table columns to inspect rows, for example minimum, maximum and null values. Iterative inspection and application of data preparation tools can help to improve data quality and assurance throughout the process.
Once you’ve completed configuration of the data pipeline, you can run it to create or update an ArcGIS feature layer that will be available in your content. You can configure geometry and time properties for the output feature layer so it’s ready for use in additional workflows such as spatial or temporal analysis, dashboards, or web maps.
Data Pipelines can be a suitable choice when:
For more information on Data Pipelines, see Introduction to Data Pipelines.
The ArcGIS Data Interoperability extension is a spatial extract, transform, and load (ETL) tool set for both ArcGIS Pro and ArcGIS Enterprise that runs either through standalone tools or in the geoprocessing framework. Data Interoperability is built from and is compatible with Safe Software’s FME product.
Data Interoperability provides users with a no-code authoring experience using the Workbench desktop app, with work saved to local workbench files, or as Spatial ETL tools in a project toolbox. Workbenches can connect to a wide variety of data sources, perform simple or complex operations on data in transit, and then write data to a wide variety of destinations. This is a fully featured ETL system, where workbenches can be run manually for inspection or testing, automated to run on a regular basis, or invoked as part of a geoprocessing service.
Data Interoperability provides three desktop apps (Workbench, Data Inspector and Quick Translator) that can be accessed from ArcGIS Pro. ETL tool results can also be shared to ArcGIS Server as web tools like any other geoprocessing result. This lets you publish ETLs which can be accessed as web services, for example in a Web AppBuilder geoprocessing widget.
Data Interoperability may be a suitable option when:
For more information on Data Interoperability, see the following resources:
An additional common approach to building ETLs is to use Python, usually in combination with arcpy
and the arcgis
module of the ArcGIS Python API. The Python developer community, along with many commercial providers, have created hundreds of libraries that assist with connecting to almost any imaginable data source, from database clients to storage types, file parsers, and web client libraries. This means that any data source you can connect to, can be pulled into Python, transformed or reshaped, and pushed to an output, which in an ArcGIS-based system is often an editable feature layer, image file or 3D dataset.
A popular pattern is using a Notebook that takes advantage of industry standard data access protocols. For example, when integrating with a cloud warehouse, install the warehouse’s PEP 249-compliant database access package so the Notebook can use standard SQL or cursor-based processing.
An emerging pattern is to write Python-based ETLs that take advantage of remote file access over HTTP using the S3 API that is now common among industry storage providers. This pattern applies to cloud object stores hosting well known file types like Apache Parquet, GeoParquet, JSON or CSV, and which is enabled by extending the Python environment. This approach allows files be read or written as if they are in a SQL-compliant database, with support for partitioning a single logical table across any number of physical files to take advantage of parallel processing across the set of files. Together, these two approaches allow you to perform ETL against cloud warehouses and cloud object stores with powerful performance and a complete feature set.
A Python-based ETL may be suitable when:
Many other ETL offerings are available from cloud providers, independent software vendors, and through open-source tools. Each of these offerings will support different inputs, processors, and outputs, so it is important to consider the data movement requirements for your system before choosing an offering. Both ArcGIS Data Pipelines and ArcGIS Data Interoperability offer a no-code experience and include pre-built connections to write data into ArcGIS, which can be an important efficiency boost when creating a system that implements the no-code pattern.
This section provides several example strategies for working with data migration workflows, using several different technologies and with useful tips and guidelines within each strategy. These samples should be used to build your organizational data migration workflows, not used as-is or without review.
This pattern relies on ArcGIS Pro only.
One of the most common shared file formats is CSV – comma separated values, also known as ASCII or Text files, followed closely by Microsoft Excel workbooks. CSV and Excel files are popular for their ease of sharing, reading (including by people) and writing in many applications. CSV and Excel files with coordinate columns may be used directly in Pro, possibly after coordinate notation conversion or geocoding. The help topics linked above provide good guidance on standard file usage. However, in an ETL context – where you want to reach then prepare data for use - there are other common problems to be overcome:
This sample shows how to handle these issues with minimal effort using core tools.
This pattern relies on ArcGIS Pro along with an extended Python Notebook environment.
Cloud object stores such as Amazon S3, Google Cloud Storage, Microsoft Azure Blob Storage, MinIO, or others offer a cost-effective and performant file sharing mechanism, with support for secured or public access. Consider object storage differently from a ‘remote filesystem’ in cloud storage, which for ETL purposes can be considered local data. Currently, cloud store connections in Pro support only raster data, so the catalog experience in ArcGIS Pro, including folder connections, does not display cloud-hosted vector data such as Apache Parquet or GeoParquet, JSON, GeoJSON, CSV or Excel, all common ETL source formats. This limitation can be overcome around by extending the Pro Python environment to provide virtual filesystem support for cloud stores and to treat remote files as relational database objects, namely DuckDB. Data publishers may work at any scale, as may you, so cloud hosted well-known file ETL needs to support:
This sample shows how to handle these requirements with the same flexible tooling.
This pattern uses Data Pipelines to maintain existing feature layers created through other means.
ArcGIS Data Pipelines can stand on its own as an ETL solution, but there are two cases where it may be considered as a maintenance tool for hosted feature layers created by other means:
Existing workflows may be adequate or even necessary to create your layers but not offer Data Pipelines’ advantages of ease of use, organizational sharing model, SaaS hosting and scheduling capability. In this situation you can consider Data Pipelines as a maintenance tool.
This sample shows a simple example of hosted layer maintenance.
This pattern uses a hosted Python Notebook in ArcGIS Online with a Python runtime environment extended to access a cloud object store containing data in GeoParquet format.
Extending the runtime environments for ArcGIS Pro and ArcGIS Enterprise Notebook Server are common patterns, but a less well-known pattern can accomplish the same result in an ArcGIS Online hosted notebook. The available Python libraries in the standard Python runtimes include the conda
and pip
package managers, with pip
generally being the simpler option. If you find yourself in need of connectivity to an external system that offers a Python interface, then it is likely to be available and simply installed from PyPi.
This sample uses pip
to extend the local notebook runtime and DuckDB to access data from the Overture Maps Foundation’s Places theme – an example of accessing an Amazon S3 object store by hive partition query – meaning in this case any number of GeoParquet files are accessed by wildcard query. Once accessed and queried, the data is used to refresh a hosted feature layer of places of interest in London, England (and could support any other worldwide location).
This pattern uses an Online Notebook with a runtime environment extended to support read or write of Snowflake. Any cloud warehouse may be accessed similarly.
This pattern is similar to ArcGIS Online notebooks and cloud-native GeoParquet in that it uses an extended runtime container, but it adds the processing power of a cloud warehouse. ArcGIS Pro and Enterprise support connectivity to Snowflake and other cloud data warehouses to create query layers, which can be used as input to geoprocessing tools or visualized in ArcGIS Pro maps or web maps, but using this pattern brings the power of SQL to your Online notebook processing. In addition to a cloud warehouse being a database, it is also a data sharing ecosystem – other organizations can share data of any scale into the warehouse and combine those data to generate insights and new results.
This sample maintains a hosted feature layer of places of interest in San Francisco, California (or anywhere else worldwide you define). The data is sourced from Esri business partner SafeGraph’s Places dataset.
This pattern uses ArcGIS Pro and the ArcGIS Data Interoperability extension to automate the creation or import of metadata during ETL processing.
Documenting the content and processing steps of items in ArcGIS is key to usability and trust, but it is an easily overlooked activity, making the automation of generative steps very valuable. In ETL, you are often working with external data, another common disconnection with ArcGIS metadata flow.
This sample shows how to build metadata creation or harvesting into ETL workflows that maintain a hosted feature service, but are equally valid with other outputs, such as geodatabases.
This pattern uses ArcGIS Pro and OLE DB connections to connect to an external ODBC-compatible data source.
While ArcGIS supports many popular database platforms, more connections may be made in ArcGIS Pro to relational data, local or remote, using industry standard OLE DB data sources. These extend the database connection options in ArcGIS Pro, making additional data sources accessible for mapping, data engineering and analytic workflows, including ETL. Connections may be shared across your network. If you need to work with data in an external database, OLE DB connections are a simple and usually available option. Beyond any built-in support in ArcGIS, some cloud data lakes and warehouses also offer ODBC drivers to simplify connectivity.
This sample shows how ArcGIS tools can use an OLE DB connection to enrich unsupported tabular data, applying a desired schema, and sending the result persisted as a geodatabase feature class.
This pattern uses ArcGIS Pro and the ArcGIS Data Interoperability extension to interact with a API that returns large responses through pagination - in chunks of JSON which then need parsing to be parsed.
HTTP-based REST APIs often return a JSON response, popular for its combination of readability (by people or machines) and flexibility, but the JSON objects are loosely typed and often unstructured and unordered. For responsiveness reasons, APIs may return a logical dataset in pieces (“pages”), generated sequentially, but which must be aggregated at the client. These two challenges of pagination and parsing into a defined schema are easily solved with a low-code strategy.
This sample shows how to interact with a paging API to synchronize data daily with an ArcGIS Online hosted feature layer.
This pattern uses ArcGIS Pro, ArcGIS Enterprise web tools, and optionally ArcGIS Data Interoperability.
Perhaps the most prevalent organizational approach for any work, computerized or not, is repetition on a schedule. Beginning with ArcGIS Pro 3.4 and ArcGIS Enterprise 11.4, web tools (geoprocessing services) authored in ArcGIS Pro may be also scheduled in ArcGIS Pro. Geoprocessing tools that do ETL work may be authored as Python script tools, ModelBuilder models or Data Interoperability spatial ETL tools, shared as web tools, then scheduled from the tool’s Run
menu in ArcGIS Pro. Key to making this pattern work is making sure your input and output dataset parameters are registered with or accessible to ArcGIS Enterprise and not uploaded as static copies at publication time.
This sample provides an example of a spatial ETL web tool scheduled to run twice daily to mirror an RSS feed to an ArcGIS Online hosted feature layer.