Comparing integration tools – Options for Data Integration

One of the greatest benefits of using cloud services such as Azure is that it gives you the ability to create the necessary resources required without needing to invest large amounts of capital. The tools you can choose from cover end-to-end processes and are scaled in and out as needed.

One of the first decisions you may need to consider is where to initially store raw data. Except in the case of streaming analytics, whereby you continually ingest data from a source such as an IoT device (for example, a temperature sensor), you need a place to store and retrieve your data files from.

Azure storage accounts provide storage capabilities in the form of file storage or Blob storage; however, a specific type of account called an Azure Data Lake Storage Gen2 (ADLS Gen2) account might be better suited to data analytics.

ADLS Gen2

ADLS is an optional configuration feature of a standard storage account. One of the key differences is that it supports filesystem-style hierarchical namespaces. This allows files and objects to be stored in folder structures, thus providing greater performance.

You can still provision ADLS Gen1; however, this is not recommended. Gen2 includes additional features over Gen1. The following table shows the key differences between the two:

It is also worth noting that you cannot directly upgrade a Gen1 configuration to a Gen2 configuration. If you wish to migrate, you must create a new Gen2 account and manually copy or move the data.

Quite often, large data operations involve enumerating and modifying a large number of files at the same time or modifying a directory name that contains those files. Standard blob storage supports directories; however, these are virtual, so this means that when performing an operation on a directory (such as renaming it), you need to update every file.

With ADLS Gen2, the hierarchical namespace means directory-level operations apply at a directory level. ADLS Gen2 also supports the Hadoop Distributed File System (HDFS), and therefore supports a driver called the Azure Blob File System (ABFS) driver. All this essentially means that ADLS Gen2 can be used by Hadoop tools, and Hadoop is one of the most popular big data analytics tools.

ADLS Gen2 also supports Portable Operating System Interface (POSIX) permissions and is very cost-effective, yet incredibly fast. Finally, ADLS Gen2 replaces the original ADLS offering, which although still available is no longer recommended.

Note

POSIX is a standard for defining a portable operating system. This means systems that are POSIX-compliant will also be interoperable.

Because ADLS Gen2 is a configurable item of a standard storage account, you create it by creating a general-purpose Azure storage account and enabling Hierarchical namespace in the Advanced settings, as per the following screenshot:

Figure 13.2 – Configuring an Azure storage account as ADLS Gen2

When a storage account is created as an ADLS Gen2 storage account, the https://dfs.core.windows.net application programming interface (API) endpoints become available to interact with it. You can also use the ABFS driver that supports Hadoop, using the following Uniform Resource Identifier (URI) pattern:

abfs[s]://file_system@account_name.dfs.core.windows.net/<path>/<path>/<file_name>

When using the ABFS driver, Azure translates behind the scenes the preceding URI to the REpresentational State Transfer (REST) API endpoints.

Important note

A storage account can only have ADLS Gen2 enabled at the time of creation—you cannot change this setting after it has been deployed.

Although a standard storage account can be used, an ADLS Gen2 account provides the best capabilities for big data and analytics workloads.

ADLS Gen2 therefore provides you with a storage location to ingest data from, and it can also be used to store processed data. Services such as Power BI can be connected directly to storage accounts to create visualization dashboards.

Most other Azure tools can either write or read from ADLS Gen2 storage accounts, such as Azure Synapse Analytics or Azure Databricks.

Before we look at some of these services, we will examine another tool that makes orchestrating the end-to-end ingestion and transformation of data easier.


Tags:


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *