Azure-hosted IR – Options for Data Integration

An Azure-hosted IR is fully managed by Azure; these IRs can run Data flow, Data movement, and Dispatch activities both on public and private networks. Azure Dispatch launches and monitors activities such as Azure Databricks. Azure Dispatch activities are those that connect to other services such as Azure Databricks.

An Azure-hosted IR creates virtual machines (VMs) in the background to run your pipelines on, and you can configure the type of compute it uses, from general-purpose, compute-optimized, and memory-optimized. You can also configure the number of cores created for use by that VM, from 4 up to 256. Compute resources are automatically shut down when idle.

When creating an Azure-hosted IR, you have the option of using VNets or not. When VNet configuration is disabled, you can only connect to services on public endpoints; however, on enabling it, you can then connect to your Azure-hosted services over private endpoints. This negates the need to configure network firewall rules on services such as Azure SQL Database or Azure Storage accounts. Without a VNet configuration enabled, if your storage and Structured Query Language (SQL) accounts blocked public access, you would need to manually add the Data Factory Internet Protocol (IP) ranges for the region they are running in.

Self-hosted IRs

Self-hosted IRs allow you the option of creating a VM yourself and installing the IR on it. This can provide more control, and even allows the IRs to be hosted on-premises. However, the VM must be maintained by you (including the stopping and starting of it). You lose the dynamic ability to easily create massive core-based servers for individual jobs.

Self-hosted IRs do not support Data flow activities, only Data movement and Dispatch activities.

Azure SQL Server Integration Services (SSIS)

Azure SSIS hosting is specifically designed for running SQL Server Integration Services (SSIS) packages. It makes an ideal choice for scenarios where a lot of investment has already been made in developing SSIS packages, and allows you to take these to the cloud more efficiently.

Note

SSIS is an optional add-on to Microsoft SQL Server and provides the ability to create extract, transform, and load (ETL) packages for automating data movement and transformation. In many ways, Azure Data Factory can replace such packages; however, businesses often invest a great deal of time and money in developing them, hence why Azure provides the ability to reuse them in Data Factory.

When connecting Azure Data Factory to your services, you must consider how those services will authenticate. For example, when connecting to a storage account that is locked down using RBAC controls, you need to find a way to provide a valid user.

For scenarios such as this, you can either use native connectivity options—for example, shared access signature (SAS) tokens with storage accounts or SQL authentication for Azure SQL Database. Alternatively, you can use service principals or managed service identities, which we discussed in more detail in Chapter 6, Building Application Security.

An example of building a simple data copy pipeline using managed identities can be found on my blog at https://bretthargreaves.com/2020/11/10/move-data-securely-with-azure-data-factory.

Out of the box, Azure Data Factory provides a wide range of tools and activities. However, there are times when advanced tooling is required—specifically, data analytics services. Two analytical tools that can be integrated into Azure Data Factory are Azure Databricks and Azure Synapse Analytics, which we will explore next.


Tags:


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *