Skip main navigation

Working with data factory components

Learn how to debug data factory pipelines and add parameters to data factory components with this article.

Let’s learn about the data factory control flow, data factory pipelines, and how to debug data factory pipelines and how to add parameters to data factory components.

What is control flow?

Control flow is an orchestration of pipeline activities that includes chaining activities in a sequence, branching, defining parameters at the pipeline level, and passing arguments while invoking the pipeline on demand or from a trigger.

Control flow can also include looping containers, that can pass information for each iteration of the looping container.

If a For Each loop is used as a control flow activity, Azure Data Factory can start multiple activities in parallel using this approach. This allows you to build complex and iterative processing logic within the pipelines you create with Azure Data Factory, which supports the creation of diverse data integration patterns such as building a modern data warehouse.

Some of the common control flow activities are described in the below sections.

Chaining activities

Within Azure Data Factory, you can chain activities in a sequence within a pipeline. It is possible to use the dependsOn property in an activity definition to chain it with an upstream activity.

Branching activities

Use Azure Data Factory for branching activities within a pipeline. An example of a branching activity is The If-condition activity which is similar to an if-statement provided in programming languages. A branching activity evaluates a set of activities, and when the condition evaluates to true, a set of activities are executed. When it evaluates to false, then an alternative set of activities is executed.

Parameters

You can define parameters at the pipeline level and pass arguments while you’re invoking the pipeline on-demand or from a trigger. Activities then consume the arguments held in a parameter as they are passed to the pipeline.

Custom state passing

Custom state passing is made possible with Azure Data Factory. Custom state passing is an activity that created output or the state of the activity that needs to be consumed by a subsequent activity in the pipeline. An example is that in a JSON definition of an activity, you can access the output of the previous activity. Using custom state passing enables you to build workflows where values are passing through activities.

Looping containers

The looping containers umbrella of control flow such as the ForEach activity defines repetition in a pipeline. It enables you to iterate over a collection and runs specified activities in the defined loop. It works similarly to the ‘for each looping structure’ used in programming languages. Besides each activity, there is also an Until activity. This functionality is similar to a do-until loop used in programming. What it does is running a set of activities (do) in a loop until the condition (until) is met.

Trigger-based flows

Pipelines can be triggered by on-demand (event-based, for example, blob post) or wall-clock time.

Invoke a pipeline from another pipeline

The Execute Pipeline activity with Azure Data Factory allows a Data Factory pipeline to invoke another pipeline.

Delta flows

Use-cases related to using delta flows are delta loads. Delta loads in ETL patterns will only load data that has changed since a previous iteration of a pipeline. Capabilities such as lookup activity, and flexible scheduling helps handling delta load jobs. In the case of using a Lookup activity, it will read or look up a record or table name value from any external source. This output can further be referenced by succeeding activities.

Other control flows

There are many more control flow activities. See the following items for other useful activities:

  • Web activity: The web activity in Azure Data Factory using control flows, can call a custom RESTendpoint from a Data Factory pipeline. Datasets and linked services can be passed in order to get consumed by the activity.
  • Get metadata activity: The Get metadata activity retrieves the metadata of any data in Azure Data Factory.

Work with data factory pipelines

To work with data factory pipelines, it is imperative to understand what a pipeline in Azure Data Factory is.

A pipeline in Azure Data Factory represents a logical grouping of activities where the activities together perform a certain task.

An example of a combination of activities in one pipeline can be, ingesting and cleaning log data in combination with a mapping data flow that analyzes the log data that has been cleaned.

A pipeline enables you to manage the separate individual activities as a set, which would otherwise be managed individually. It enables you to deploy and schedule the activities efficiently by using a single pipeline, versus managing each activity independently.

Activities in a pipeline are referred to as actions that you perform on your data. An activity can take zero or more input datasets and produce one or more output datasets.

An example of an action can be the use of a copy activity, where you copy data from an Azure SQL Database to an Azure DataLake Storage Gen2. To build on this example, you can use a data flow activity or an Azure Databricks Notebook activity for processing and transforming the data that was copied to your Azure Data Lake Storage Gen2 account, in order to have the data ready for business intelligence reporting solutions like in Azure Synapse Analytics.

Because there are many activities that are possible in a pipeline in Azure Data Factory, we have grouped the activities in three categories:

  • Data movement activities: The Copy Activity in Data Factory copies data from a source data store to a sink data store.
  • Data transformation activities: Azure Data Factory supports transformation activities such as Data Flow, Azure Function, Spark, and others that can be added to pipelines either individually or chained with another activity.
  • Control activities: Examples of control flow activities are ‘get metadata’, ‘For Each’, and ‘Execute Pipeline’.

Activities can depend on each other. What we mean, is that the activity dependency defines how subsequent activities depend on previous activities. The dependency itself can be based on a condition of whether to continue in the execution of previous defined activities in order to complete a task. An activity that depends on one or more previous activities, can have different dependency conditions.

The four dependency conditions are:

  • Succeeded
  • Failed
  • Skipped
  • Completed

For example, if a pipeline has an Activity A, followed by an Activity B and Activity B has as a dependency condition on Activity A ‘Succeeded’, then Activity B will only run if Activity A has the status of succeeded.

If you have multiple activities in a pipeline and subsequent activities are not dependent on previous activities, the activities may run in parallel.

Debug data factory pipelines

Customer requirements and expectations are changing in relation to data integration. The need among users to develop and debug their Extract Transform/Load (ETL) and Extract Load/Transform (ELT) workflows iteratively is therefore becoming more imperative.

Azure Data Factory can help you build and develop iterative debug Data Factory pipelines when you develop your data integration solution. By authoring a pipeline using the pipeline canvas, you can test your activities and pipelines by using the Debug capability.

In Azure Data Factory, there is no need to publish changes in the pipeline or activities before you want to debug. This is helpful in a scenario where you want to test the changes and see if it works as expected before you actually save and publish them.

Sometimes, you don’t want to debug the whole pipeline but test a part of the pipeline. A Debug run allows you to do just that. You can test the pipeline end to end or set a breakpoint. By doing so in debug mode, you can interactively see the results of each step while you build and debug your pipeline.

Debug and publish a pipeline

As you create or modify a pipeline that is running, you can see the results of each activity in the Output tab of the pipeline canvas.

After a test run succeeds, and you are satisfied with the results, you can add more activities to the pipeline and continue debugging in an iterative manner. When you are not satisfied, or like to stop the pipeline from debugging, you can cancel a test run while it is in progress. Be aware that by selecting the debug slider, it will actually run the pipeline. Therefore, if the pipeline contains, for example, a copy activity, the test run will copy data from source to destination.

A best practice is to use test folders in your copy activities and other activities when debugging, such that when you are satisfied with the results and have debugged the pipeline, you switch to the actual folders for your normal operations.

To debug the pipeline, select Debug on the toolbar. You see the status of the pipeline run in the Output tab at the bottom of the window.

Debug slider Azure Data Factory

Output tab Azure Data Factory Debug

After the pipeline can run successfully, in the top toolbar, select Publish all. This action publishes entities (datasets, and pipelines) you created to Data Factory.

Publish all changes and entities in Azure Data Factory

Wait until you see the successfully published message. To see notification messages, select the Show Notifications (bell icon) on the top-right of the portal (bell button).

Map dataflow debug

During the building of Mapping Data Flows, you can interactively watch how the data shapes and transformations are executing so that you can debug them. To use this functionality, it is first necessary to turn on the “Data Flow Debug” feature.

The debug session can be used both in Data Flow design sessions, and during pipeline debug execution of data flows. After the debug mode is on, you will actually build the data flow with an active Spark Cluster. The Spark cluster will close once the debug is off. You do have a choice in what compute you’re going to use. When you use an existing debug cluster, it will reduce the start-up time. However, for complex or parallel workloads you might want to spin up your own just-in-time cluster.

Best practices for debugging data flows are to keep the debug mode on, and to check and validate the business logic included in the data flow. Visually viewing the data transformations and shapes helps you see the changes.

If you want to test the dataflow in a pipeline that you’ve created, it is best to use the Debug button on the pipeline panel. While data preview doesn’t write data, a debug run within your dataflow will write data, just like debugging a pipeline, to your sink destination.

Debug settings

As previously described, each debug session that is started from the Azure Data Factory user interface, is considered a new session with its own Spark cluster. To monitor the sessions, you can use the monitoring view for the debug session to manage your debug sessions per the Data Factory that has been set up.

To see whether a Spark cluster is ready for debugging, you can check the cluster status indication at the top of the design surface. If it’s green, it’s ready. If the cluster wasn’t running when you entered debug mode, the waiting time could be around 5–7 minutes because the clusters need to spin up.

It is a best practice that after you finish debugging, you switch off the debug mode so that the Spark cluster terminates.

When you’re debugging, you can edit the preview of data in a data flow by selecting Debug Setting. Examples of changing the data preview could be a row limit or file source in case you use source transformations. When you select the staging linked service, you can use Azure Synapse Analytics as a source.

If you have parameters in your Data Flow or any of its referenced datasets, you can specify what values to use during debugging by selecting the Parameters tab. During debugging, sinks are not required and are ignored in the dataflow. If you want to test and write the transformed data to your sink, you can execute the data flow from a pipeline, and use the debug execution from the pipeline.

As previously described, within Azure Data Factory, it is possible to only debug up to a certain point or an activity. To do so, you can use a breakpoint on the activity up to where you want to test, and then select Debug. A Debug Until option appears as an empty red circle at the upper right corner of the element. After you select the Debug Until option, it changes to a filled red circle to indicate the breakpoint is enabled. Azure Data Factory will then make sure that the test only runs until that breakpoint activity in the pipeline. This feature is useful when you want to test only a subset of the activities in a pipeline.

In most scenarios, the debug features in Azure Data Factory are sufficient. However, sometimes it is necessary to test changes in a pipeline in a cloned sandbox environment. A use-case to do so is when you have parameterized ETL pipelines that you’d like to test how they would behave when they trigger a file arrival versus over tumbling time window. In this case, the cloning of a sandbox environment might be more suitable.

A good thing to know about Azure Data Factory might be that because it’s mostly only charged by the number of runs, a second Data Factory doesn’t have to lead to additional charges.

Monitor debug runs

To monitor debug runs, you can check the output tab, but only for the most recent run that occurred in the browsing session, because it won’t show the history. If you would like to get a view of the history of debug runs, or see all the active debug runs, you can go to the Monitor tab.

One thing to take in mind is that the Azure Data Factory service only keeps debug run history for 15 days. In relation to monitoring your data flow debug sessions, you would also go to the Monitor tab.

Add parameters to data factory components

Parameterize linked services in Azure Data Factory

Within Azure Data Factory, it is possible to parameterize a linked service in which you can pass through dynamic values during run time. A use-case for this scenario is connecting to several different databases that are on the same SQL server, in which you might think about parameterizing the database name in the linked service definition. The benefit of doing this is that you don’t have to create a single linked service for each database that is on the same SQL Server.

It is also possible to parameterize other properties of the linked service like a username.

If you decide to parameterize linked services in Azure Data Factory, you can do this in the Azure Data Factory user interface, the Azure portal, or a programming interface of your preference.

If you choose to author the linked service through the user interface, Data Factory can provide you with built-in parameterization for some of the connectors:

  • Amazon Redshift
  • Azure Cosmos DB (SQL API)
  • Azure Database for MySQL
  • Azure SQL Database
  • Azure Synapse Analytics (formerly SQL DW)
  • MySQL
  • Oracle
  • SQL Server
  • Generic HTTP
  • Generic REST

If you go to the creation/edit blade of the linked service, you will find the options for parameterizing.

Parameter settings in Linked Service

If you cannot use the built-in parameterization because you’re using a different type of connector, you are able to edit the JSON through the user interface.

In linked service creation/edit pane, expand Advanced at the bottom of the pane, select the Specify dynamic contents in JSON format checkbox, and specify the linked service JSON payload.

Parameter settings editing JSON through UI

Or, after you create a linked service without parameterization, in the Management hub, select Linked services, and find the specific linked service. Then, select {} (Code button) to edit the JSON.

Global parameters in Azure Data Factory

Setting global parameters in an Azure Data Factory pipeline allows you to use these constants for consumption in pipeline expressions. A use-case for setting global parameters is when you have multiple pipelines where the parameters names and values are identical.

If you use the continuous integration and continuous (CI/CD) deployment process with Azure Data Factory, the global parameters can be overridden, if you want, for each environment that you have created.

Create global parameters in Azure Data Factory

To create a global parameter, go to the Global parameters tab in the Manage section. Select New to open the creation side menu pane.

In the side menu pane, enter a name, select a data type, and specify the value of your parameter.

After a global parameter is created, you can edit it by selecting the parameter’s name. To alter multiple parameters together, select Edit all.

Use global parameters in a pipeline

When using global parameters in a pipeline in Azure Data Factory, it is mostly referenced in pipeline expressions. For example, if a pipeline references to a resource like a dataset or data flow, you can pass down the global parameter value through the resource parameter. The command or reference of global parameters in Azure Data Factory flows as follows: pipeline().globalParameters..

Global parameters in CI/CD

When you integrate global parameters in a pipeline using CI/CD with Azure Data Factory, you have two ways in order to do so:

  • Include global parameters in the Azure Resource Manager template
  • Deploy global parameters via a PowerShell script

In most CI/CD practices, it’s beneficial to include global parameters in the Azure Resource Manager template. It’s recommended because of their native integration with CI/CD, where global parameters are added as an Azure Resource Manager Template parameter. This is due to changes in several environments that are worked in.

To enable global parameters in an Azure Resource Manager template, go to the Management hub. Be aware that after you add global parameters to an Azure Resource Manager template, it adds an Azure Data Factory level setting, which can override other settings like git configs.

The use-case for deploying global parameters through a PowerShell script could be because you might have the previously described settings enabled in an elevated environment, like UAT or PROD.

Parameterize mapping dataflows

Within Azure Data Factory, you can use mapping data flows, which enable you to use parameters. If you set parameters inside a data flow definition, you can use the parameters in expressions. The parameter values will be set by the calling pipeline through the Execute Data Flow activity.

There are three options for setting the values in the data flow activity expressions:

  • Use the pipeline control flow expression language to set a dynamic value.
  • Use the data flow expression language to set a dynamic value.
  • Use either expression language to set a static literal value.

The reason for parameterizing mapping data flows is to make sure that your data flows are generalized, flexible, and reusable.

Create parameters in dataflow

To add parameters to your data flow, select the blank portion of the data flow canvas to see the general properties.

In the Settings pane, you will see a Parameter tab.

Select New to generate a new parameter. For each parameter, you must assign a name, select a type, and optionally set a default value.

Assign parameters from a pipeline in mapping dataflow

If you have created a data flow in which you have set parameters, it’s possible to execute it from a pipeline using the Execute Data Flow Activity.

After you have added the activity to the pipeline canvas, you’ll find the data flow parameters in the activity’s Parameters tab.

Assigning parameter values ensures that you are able to use the parameters in a pipeline expression language or data flow expression language based on spark types. You can also combine the two, that is, the pipeline and data flow expression parameters.

We will also learn how to integrate a Notebook within Azure Synapse Pipelines. Remember to click on Mark as complete and then Next to continue.

This article is from the free online

Introduction to Data Engineering with Microsoft Azure 1

Created by
FutureLearn - Learning For Life

Reach your personal and professional goals

Unlock access to hundreds of expert online courses and degrees from top universities and educators to gain accredited qualifications and professional CV-building certificates.

Join over 18 million learners to launch, switch or build upon your career, all at your own pace, across a wide range of topic areas.

Start Learning now