Continuous integration and delivery (CI/CD) in Azure Data Factory using DevOps and GitHub

Posted by Rayis Imayev on February 18, 2019

(2019-Feb-18) With Azure Data Factory (ADF) continuous integration, you help your team to collaborate and develop data transformation solutions within the same data factory workspace and maintain your combined development efforts in a central code repository. Continuous delivery helps to build and deploy your ADF solution for testing and release purposes. Basically, the CI/CD process helps to establish a good software development practice and aims to build a healthy relationship between development, quality assurance, and other supporting teams.

Update 2020-Mar-15: Part 2 of this blog post published in 2020-Jan-28:
http://datanrg.blogspot.com/2020/01/continuous-integration-and-delivery.html

Update 2019-Jun-24:
Video recording of my webinar session on Continuous Integration and Delivery (CI/CD) in Azure Data Factory at a recent PASS Cloud Virtual Group meeting.

Back in my SQL Server Integration Services (SSIS) heavy development times, SSIS solution deployment was a combination of building ispac files and running a set of PowerShell scripts to deploy this file to SSIS servers and update required environmental variables if necessary.

Microsoft introduced DevOps culture for software continuous integration and delivery - https://docs.microsoft.com/en-us/azure/devops/learn/what-is-devops, and I really like their starting line, "DevOps is the union of people, process, and products to enable continuous delivery of value to our end users"!

Recently, they've added a DevOps support for Azure Data Factory as well - https://docs.microsoft.com/en-us/azure/data-factory/continuous-integration-deployment and those very steps I tried to replicate in this blog post.

My ADF pipeline use-case to deploy

So, here is my use case: I have a pipeline in my Azure Data Factory synced in GitHub (my Development workspace) and I want to be able to Test it in a separate environment and then deploy to a Production environment.

Development > Testing > Production

In my ADF I have a template pipeline solution "Copy multiple files containers between File Stores" that I used to copy file/folders from one blob container to another one.

In my Azure development resource group, I created three artifacts (which may vary depending on your own development needs):
- Key Vault, to store secret values
- Storage Account, to keep my testing files in blob containers
- Data Factory, the very thing I tend to test and deploy

Creating continuous integration and delivery (CI/CD) pipeline for my ADF

Step 1: Integrating your ADF pipeline code to source control (GitHub)
In my previous blog post (Azure Data Factory integration with GitHub) I had already shown a way to sync your code to the GitHub. For the continuous integration (CI) of your ADF pipeline, you will need to make sure that along with your development (or feature) branch you will need to publish your master branch from your Azure Data Factory UI. This will create an additional adf_publish branch of your code repository in GitHub.

development > master > adf_publish

As Gaurav Malhotra (Senior Product Manager at Microsoft) has commented about this code branch, "The adf_publish branch is automatically created by ADF when you publish from your collaboration branch (usually master) which is different from adf_publish branch... The moment a check-in happens in this branch after you publish from your factory, you can trigger the entire VSTS release to do CI/CD"

Publishing my ADF code from master to adf_publish branch automatically created two files:
Template file (ARMTemplateForFactory.json): Templates containing all the data factory metadata (pipelines, datasets etc.) corresponding to my data factory.

Configuration file (ARMTemplateParametersForFactory.json): Contains environment parameters that will be different for each environment (Development, Testing, Production etc.).

Step 2: Creating a deployment pipeline in Azure DevOps

2.a Access to DevOps: If you don't have a DevOps account, you can start for free - https://azure.microsoft.com/en-ca/services/devops/ or ask your DevOps team to provide you with access to your organization build/release environment.

2.b Create new DevOps project: by clicking [+ Create project] button you start creating your build/release solution.

2.c Create a new release pipeline
In a normal software development cycle, I would create a build pipeline in my Azure DevOps project, then archive the built files and use those deployment files for further release operations. As Microsoft team indicated in their CI/CD documentation page for Azure Data Factory, the Release pipeline is based on the source control repository, thus the Build pipeline is omitted; however, it's a matter of choice and you can always change it in your solution.

Select [Pipelines] > [Release] and then click the [New pipeline] button in order to create a new release pipeline for your ADF solution.

2.d Create Variable groups for your solution
This step is optional, as you can add variables directly to your Release pipelines and individual releases. But for the sake of reusing some of the generic values, I created three groups in my DevOps project with the same (Environment) variable for each of these groups with ("dev", "tst", "prd") corresponding values within.

2.e Add an artifact to your release pipeline
In the Artifact section, I set the following attributes:
- Service Connection my GitHub account
- GitHub repository for my Azure Data Factory
- Default branch (adf_branch) as a source for my release processing
- Alias (Development) as it is the beginning of everything

2.f Add new Stage to your Release pipeline
I click [+ Add a stage] button, select "Empty Job" template and name it as "Testing". Then I add two agent jobs from the list of tasks:
- Azure Key Vault, to read Azure secret values and pass them for my ADF ARM template parameters
- Azure Resource Group Deployment, to deploy my ADF ARM template

In my real job ADF projects, DevOps Build/Release pipelines are more sophisticated and managed by our DevOps and Azure Admin teams. Those pipelines may contain multiple PowerShell and other Azure deployment steps. However, for my blog post purpose, I'm only concerned to deploy my data factory to other environments.

For the (Azure Key Vault) task, I set "Key vault" name to this value: azu-caes-$(Environment)-keyvault, which will support all three environments' Key Vaults depending on the value of the $(Environment) group variable {dev, tst, prd}.

For the (Azure Resource Group Deployment) task, I set the following attributes:
- Action: Create or update resource group
- Resource group: azu-caes-$(Environment)-rg
- Template location: Linked artifact
- Template: in my case the linked ART template file location is
$(System.DefaultWorkingDirectory)/Development/azu-eus2-dev-datafactory/ARMTemplateForFactory.json
- Template parameters: in my case the linked ARM template parameters file location is
$(System.DefaultWorkingDirectory)/Development/azu-eus2-dev-datafactory/ARMTemplateParametersForFactory.json
- Override template parameters: some of my ADF pipeline parameters I either leave blank or with default value, however, I specifically set to override the following parameters:
+ factoryName: $(factoryName)
+ blob_storage_ls_connectionString: $(blob-storage-ls-connectionString)
+ AzureKeyVault_ls_properties_typeProperties_baseUrl: $(AzureKeyVault_ls_properties_typeProperties_baseUrl)
- Deployment mode: Incremental

2.g Add Release Pipeline variables:
To support reference of variables in my ARM template deployment task, I add the following variables within the "Release" scope:
- factoryName
- AzureKeyVault_ls_properties_typeProperties_baseUrl

2.h Clone Testing stage to a new Production stage:

By clicking the "Clone" button in the "Testing" stage I create a new release stage and name it as "Production". I will keep both tasks within this new stage unchanged, where the $(Environment) variable should do the whole trick of switching between two stages.

2.i Link Variable groups to Release stages:
A final step in creating my Azure Data Factory release pipeline is that I need to link my Variable groups to corresponding release stages. I go to the Variables section in my Release pipeline and select Variable groups, by clicking [Link variable group] button I choose to link a variable group to a stage so that both of my Testing and Production sections would look the following way:

After saving my ADF Release pipeline, I keep my fingers crossed :-) It's time to test it!

Step 3: Create and test a new release
I click [+ Create a release] button and start monitoring my ADF release process.

After it had been successfully finished, I went to check my Production Data Factory (azu-eus2-prd-datafactory) and was able to find my deployed pipeline which was originated from the development environment. And I successfully test this Production ADF pipeline by executing it.

One thing to note, that all data connections to Production Key Vault and Blob Storage account have been properly deployed though by ADF release pipeline (no manual interventions).

Step 4: Test CI/CD process for my Azure Data Factory
Now it's time to test a complete cycle of committing new data factory code change and deploying it both to Testing and Production environments.

1) I add a new "Wait 5 seconds" activity task to my ADF pipeline and saving it to my development branch of the (azu-eus2-dev-datafactory) development data factory instance.
2) I create a pull request and merge this change to the master branch of my GitHub repository
3) Then I publish my new ADF code change from master to the adf_publish branch.
4) And this automatically triggers my ADF Release pipeline to deploy this code change to Testing and then Production environments.

Don't forget to enable "Continues deployment trigger" on your Artifact release pipeline and select the final 'adf-publish' code branch as your filter. Otherwise, all of your GitHub branches code commitment might trigger release/deployment process.

5) And after the new release is successfully finished, I can check that the new "Wait" activity task appears in the Production data factory pipeline:

Summary:
1) I feel a bit exhausted after writing such a long blog post. I hope you will find it helpful in your own DevOps pipelines for your Azure Data Factory solutions.
2) After seeing a complete CI/CD cycle of my Azure Data Factory where the code was changed, integrated into the main source code repository and automatically deployed, I can confidently say that my data factory has been successfully released! :-)

Link to my GitHub repository with the ADF solution from this blog post can be found here: https://github.com/NrgFly/Azure-DataFactory-CICD

And happy data adventure!

Comments

UnknownFebruary 18, 2019 at 9:20 AM
You've used the environment parameter in the Resource Group. But what if there are 2 Data Factories in the same Resource Group? And only the name is different like xx-dev-xx or xx-uat-xx. You're variables are related to the Resource Group Name.
ReplyDelete
Replies
ozhugMarch 5, 2019 at 9:01 PM
Rayis with more than one developer or several projects with many DF pipelines and DF datasets using data factory. what is the best way to separate projects allow some assets to be developed while other are moved through the a devops pipeline to release.
ReplyDelete
Replies
aapMarch 6, 2019 at 6:49 AM
We have ADF running in our DEV and higher environments. The DEV envorionment is backed by a GIT repo and releases are automated via Azure DevOps / ARM templates.

Now, we are cleaning up some object which are not in use anymore. Several pipelines have to be deleted. But we're getting failures when we try to publish. The failure has a description like: "The document cannot be deleted since it is refernced by....". In some cases, this is not true. In fact, the so called referencing objects don't even exist anymore. The ARM files have no entry with the given name...

Does this sound familair to anybody? Does anyone know how to fix this?
ReplyDelete
Replies
AnonymousMarch 7, 2019 at 11:40 AM
Would you please elaborate on Agent job / Agent pool setting?
ReplyDelete
Replies
Rakesh DonthaMarch 8, 2019 at 8:25 AM
Hello Rayis,

I'm stuck at creating the 2.g. because i dont have any KeyVault for ADF.
its nice work. if you make video of this complete process and upload that helps us little bit more clear.

could you please help on this to complete my task.

Regards,
Rakesh Dontha
rakeshdontha34@gmail.com
ReplyDelete
Replies
Rakesh DonthaMarch 8, 2019 at 9:20 AM
can you suggest me what are the variables to be created in Keyvault for ADF for Dev and test. but i didnt used keyvault in my ADF because these ADF's are old.
ReplyDelete
Replies
aapMarch 11, 2019 at 2:14 AM
granting access from adf to the key vault is the answer indeed
ReplyDelete
Replies
Rakesh DonthaMarch 14, 2019 at 2:43 AM
This comment has been removed by the author.
ReplyDelete
Replies
Nihar KNMarch 15, 2019 at 5:00 PM
Hello Rayis,

Firstly, Appreciate you for creating this wonderful post as it is more elaborate and detailed!

Now I got one question for you. what if we have a separate self hosted IRs for each Datafactories been setup, then how do you handle this scenario? As part of your demo even the selfhosted IR node created in Dev will also be deployed to INT & PROD, but what if we have dedicated self hosted IRs to be used for each environment. One thing i observed is the self hosted IR node is not been parameterized as part of ARMTemplateParametersForFactory.json, so how to handle this scenario. I'm actually stuck with this scenario. Could you pls guide me how to handle this?

Thanks much in advance!

Regards,
Nihar
ReplyDelete
Replies
AnonymousMarch 25, 2019 at 9:26 AM
I have created the Environment variable inside variable groups as described here and referred to it in the key vault name $(Environment)-kv inside the release pipeline. But it does not recognize $(Environment)! Is there anything more to be set to be able to refer to variables defined inside the variable groups?
ReplyDelete
Replies
AnonymousApril 11, 2019 at 8:17 AM
Hi Rayis,

Could you please help me in getting clarification on below point:
1. If my data factory has multiple pipeline but I checked-in few in Repo, what will happen when I deploy in testing env. Will it deploy all or the checkedin one only?

2. If I made changes to only one component suppose dataset , will ARM deploy the complete factory or the only changed part as part of CI/CD?

Thanks
ReplyDelete
Replies
Cezar NitchaiJune 20, 2019 at 4:30 PM
Hi Rayis,

What is your suggestion to implement ADF CI / CD with Azure DevOps using new feature ADF Data Flow. The Agent Azure Data Factory Deployment has no path to "Data Flow"

thanks
ReplyDelete
Replies
AnonymousJuly 29, 2019 at 11:43 AM
My ADF code is in Github enterprise and I can not make a release artifact directly without having a build in step 2.e Add an artifact to your release pipeline.
While creating a release artifact there are different source types but nothing for Github Enterprise! Any ideas?
ReplyDelete
Replies
UnknownJuly 29, 2019 at 10:26 PM
At video time 40:26, I am not able to find the service principal of the azure devops to grant access to my Keyvault. Where to get the principal ID ?
ReplyDelete
Replies
Drew HansenSeptember 18, 2019 at 8:44 AM
Thanks for this great post! It was very helpful.
It seems to me that there is an extra step in ADF releases that the rest of Azure deployments does not have -> manually publishing to the live data factory (which in turn creates the adf_publish branch). Have you thought of automating that step? perhaps a build pipeline that is triggered when the PR is merged into master that runs a script to publish. Which then would trigger the CD pipeline to deploy? Let me know if that doesn't make any sense.

Curious to hear your thoughts!

Thanks again,
Drew
ReplyDelete
Replies
Siddhesh KhavnekarSeptember 19, 2019 at 2:00 AM
We have ADFv2 pipelines where we have parameters set to default values. These values need to be changed for test environment and then for production environment.
The parameters list does not show up these parameters, so can you guide on how to proceed with setting up parameters for adfv2 pipeline with values for test and prod
ReplyDelete
Replies
Siddhesh KhavnekarSeptember 20, 2019 at 6:32 AM
https://stackoverflow.com/questions/57894768/how-to-update-adf-pipeline-level-parameters-during-cicd#

Same question is been asked on stackoverflow
ReplyDelete
Replies
kotiOctober 18, 2019 at 9:44 AM
Hi Rayis,

I am getting below error while implementing Azure adf CI/CD process.

In My data factory I am trying to move the data from on premise to azure sql database using copy task with 2 linked services (one for on premise another for azure Sql ) with self host IR .

So while executing release pipeline I am getting below error. I followed all the steps what you mentioned in above document.

BadRequest: {
"error": {
"code": "BadRequest",
"message": "Failed to encrypt sub-resource payload {\r\n \"Id\": \"/subscriptions/696c1147-0ed1-4fc0-b7dd-36af0533b28f/resourceGroups/ADF/providers/Microsoft.DataFactory/factories/ADFCICDimplementationuat/linkedservices/sourcedb\",\r\n \"Name\": \"sourcedb\",\r\n \"Properties\": {\r\n \"annotations\": [],\r\n \"type\": \"SqlServer\",\r\n \"typeProperties\": {\r\n \"connectionString\": \"********************\",\r\n \"userName\": \"********************\",\r\n \"password\": \"********************\"\r\n },\r\n \"connectVia\": {\r\n \"referenceName\": \"wl-rpt-uat\",\r\n \"type\": \"IntegrationRuntimeReference\"\r\n }\r\n }\r\n} and error is: Invalid linked service payload, please re-input the value for each property..",
"target": "/subscriptions/696c1147-0ed1-4fc0-b7dd-36af0533b28f/resourceGroups/ADF/providers/Microsoft.DataFactory/factories/ADFCICDimplementationuat/linkedservices/sourcedb",
"details": null
}
}
Task failed while creating or updating the template deployment.
ReplyDelete
Replies
AnonymousDecember 5, 2019 at 10:57 AM
Thanks for the post, really hopefull.
But just a question, if i have scheduled trigger in my DEV environment and I create this continuous integration system.
Will the trigger be launched in the three environments.
Don't know if i'm clear.
I mean I would like the scheduled trigger to be executed only on the 'Prod' environment, is this possible ?
ReplyDelete
Replies
RemyDecember 5, 2019 at 11:22 AM
Hi Rayis.

Don't know if my last comment in anonymous was created.
Anyway, first thanks for the work great post.

So know I get a question for you. I want to trigger my pipeline 1 time in a month. So I have created a scheduled trigger. I'm wondering if I am using this continuous integration how does that work ?

Will the scheduled trigger be launched in all the environment "Dev - Test - Prod" (if I have this architecture) because if it's the case that means all three environment will do the same action at the same moment and it's not what I want.
Or is it only the Prod environment which will launch the trigger.
I think quite strange if it's work like this, can you maybe enlighten me ?

Thanks
ReplyDelete
Replies
RemyDecember 9, 2019 at 11:11 AM
Hi Rayis,

I have an issue while I am doing your step.
First I have a resource group where I have my Data Factory resource.
I created my "Service Connection" which is working. But first question, the service manager linked to this service connection should be in the resource group or the data factory resource ?

So I have created my service connection (hopefully the right way), and I chose my Service Connection in the Azure Subscription when I create the Azure Deployment.
Look like I have the right and everything looks fine. But I can't select any "Resource Group", no results found.
Do you know what's wrong ? right issue, wrong service connection ?

Well still I'm trying to put the data factory resource name and when I try to launch the pipeline I get a "Failed to check the resource group status. Error: {"statusCode":403}."

I think the issue come from the fact I can't retrieve any Resource Group when I configure the pipeline.
Do you know where the issue can come from ?

Thaks
ReplyDelete
Replies
AnonymousJanuary 23, 2020 at 7:35 AM
If my understanding is correct, before creating release pipeline, we need to have Key vaults for each environment and create two secrets (one for holding storage access key and other for linked service connection string) and we need to place connection string in all three env key vaults before creating release pipeline right?
ReplyDelete
Replies
GigiJune 10, 2020 at 10:53 PM
Hi Rayis,

I have setup the devops pipeline to migrate from dev to prod and it fails with exceed in parameters limit. Most of these parameters are scheduled triggers.
Question :
1. Is there a way I can avoid these scheduled parameters in my
ArmTemplateParameters_master.json ?
2. Or is there any option to proceed with these Triggers.
ReplyDelete
Replies
RaviJuly 10, 2020 at 7:52 AM
Hi Rayis, Thank you so much for providing such great articles.
I have a question regarding CI CD deployment from Development Azure Data Factory to Production Data factory.
Example: Lets say i have a web activity used in pipelines the web activity URL is connected to logic apps.
In this case how we can customize override parameters for web activities while moving code from Dev to Prod.

Thanks in advance for your response.
Regards,
Ravi.
ReplyDelete
Replies
AshishOctober 13, 2020 at 12:53 PM
Thanks for the details. Have you used the process for incremental deployment of ADF with more than 1000 components or using linked templates
ReplyDelete
Replies
SwapnaNovember 27, 2020 at 10:45 AM
Hi ,

Im facing one issue in ADF deployment with DevOps. Please share your views.
Suppose, i have two Projects pipelines in master. These 2 projects merged into master from respective feature branches. Now i want to promote one project into QA and one into UAT.
How to handle this ? I shud remove first project code then deploy second project into UAT. ?
ReplyDelete
Replies
DeployflowJanuary 9, 2021 at 7:22 AM
It is a great blog engaging DevOps practices. I would like to add one more things that DevOps is a set of software development practices that combines software development (Dev) and information technology operations (Ops) to optimise the delivery of a product, solution or platform.
ReplyDelete
Replies
AnonymousApril 29, 2022 at 10:44 AM
once the code repo is synced , is there any way to deploy the updated codes to the respective datafactory resources without any conflicts and deletion of artifacts that is not present in code repo but present in adf.
ReplyDelete
Replies
AnonymousJanuary 12, 2023 at 7:07 AM
How do you stop the deploy flow from Dev to QA to PRD, when the linkedsevices type sftp uses different hosts? In my scenario when I deploy DEV to QA to PROD the host that exists in QA and PROD are changed by the host that is in DEV, is there a way to block this?
ReplyDelete
Replies
AdmrehatJanuary 12, 2023 at 7:09 AM
How do you stop the deploy flow from Dev to QA to PRD, when the linkedsevices type sftp uses different hosts? In my scenario when I deploy DEV to QA to PROD the host that exists in QA and PROD are changed by the host that is in DEV, is there a way to block this?
ReplyDelete
Replies

Add comment

Data Adventures

Search This Blog

Continuous integration and delivery (CI/CD) in Azure Data Factory using DevOps and GitHub

Comments

Post a Comment