Search This Blog

Monday, February 18, 2019

Continuous integration and delivery (CI/CD) in Azure Data Factory using DevOps and GitHub

(2019-Feb-18) With Azure Data Factory (ADF) continuous integration, you help your team to collaborate and develop data transformation solutions within the same data factory workspace and maintain your combined development efforts in a central code repository. Continuous delivery helps to build and deploy your ADF solution for testing and release purposes. Basically, the CI/CD process helps to establish a good software development practice and aims to build a healthy relationship between development, quality assurance, and other supporting teams.

Back in my SQL Server Integration Services (SSIS) heavy development times, SSIS solution deployment was a combination of building ispac files and running a set of PowerShell scripts to deploy this file to SSIS servers and update required environmental variables if necessary.

Microsoft introduced DevOps culture for software continuous integration and delivery - https://docs.microsoft.com/en-us/azure/devops/learn/what-is-devops, and I really like their starting line, "DevOps is the union of people, process, and products to enable continuous delivery of value to our end users"!

Recently, they've added a DevOps support for Azure Data Factory as well - https://docs.microsoft.com/en-us/azure/data-factory/continuous-integration-deployment and those very steps I tried to replicate in this blog post.





My ADF pipeline use-case to deploy

So, here is my use case: I have a pipeline in my Azure Data Factory synced in GitHub (my Development workspace) and I want to be able to Test it in a separate environment and then deploy to a Production environment.
Development > Testing > Production

In my ADF I have a template pipeline solution "Copy multiple files containers between File Stores" that I used to copy file/folders from one blob container to another one.



In my Azure development resource group, I created three artifacts (which may vary depending on your own development needs):
- Key Vault, to store secret values
- Storage Account, to keep my testing files in blob containers
- Data Factory, the very thing I tend to test and deploy


Creating continuous integration and delivery (CI/CD) pipeline for my ADF

Step 1: Integrating your ADF pipeline code to source control (GitHub)
In my previous blog post (Azure Data Factory integration with GitHub) I had already shown a way to sync your code to the GitHub. For the continuous integration (CI) of your ADF pipeline, you will need to make sure that along with your development (or feature) branch you will need to publish your master branch from your Azure Data Factory UI. This will create an additional adf_publish branch of your code repository in GitHub. 
development > master > adf_publish

As Gaurav Malhotra (Senior Product Manager at Microsoft) has commented about this code branch, "The adf_publish branch is automatically created by ADF when you publish from your collaboration branch (usually master) which is different from adf_publish branch... The moment a check-in happens in this branch after you publish from your factory, you can trigger the entire VSTS release to do CI/CD"

Publishing my ADF code from master to adf_publish branch automatically created two files:
Template file (ARMTemplateForFactory.json): Templates containing all the data factory metadata (pipelines, datasets etc.) corresponding to my data factory.

Configuration file (ARMTemplateParametersForFactory.json): Contains environment parameters that will be different for each environment (Development, Testing, Production etc.).




Step 2: Creating a deployment pipeline in Azure DevOps

2.a Access to DevOps: If you don't have a DevOps account, you can start for free - https://azure.microsoft.com/en-ca/services/devops/ or ask your DevOps team to provide you with access to your organization build/release environment.

2.b Create new DevOps project: by clicking [+ Create project] button you start creating your build/release solution.

2.c Create a new release pipeline
In a normal software development cycle, I would create a build pipeline in my Azure DevOps project, then archive the built files and use those deployment files for further release operations. As Microsoft team indicated in their CI/CD documentation page for Azure Data Factory, the Release pipeline is based on the source control repository, thus the Build pipeline is omitted; however, it's a matter of choice and you can always change it in your solution.

Select [Pipelines] > [Release] and then click the [New pipeline] button in order to create a new release pipeline for your ADF solution.



2.d Create Variable groups for your solution
This step is optional, as you can add variables directly to your Release pipelines and individual releases. But for the sake of reusing some of the generic values, I created three groups in my DevOps project with the same (Environment) variable for each of these groups with ("dev", "tst", "prd") corresponding values within.


2.e Add an artifact to your release pipeline
In the Artifact section, I set the following attributes:
- Service Connection my GitHub account
- GitHub repository for my Azure Data Factory
- Default branch (adf_branch) as a source for my release processing
Alias (Development) as it is the beginning of everything
 


2.f Add new Stage to your Release pipeline
I click [+ Add a stage] button, select "Empty Job" template and name it as "Testing". Then I add two agent jobs from the list of tasks:
- Azure Key Vault, to read Azure secret values and pass them for my ADF ARM template parameters 
- Azure Resource Group Deployment, to deploy my ADF ARM template


In my real job ADF projects, DevOps Build/Release pipelines are more sophisticated and managed by our DevOps and Azure Admin teams. Those pipelines may contain multiple PowerShell and other Azure deployment steps. However, for my blog post purpose, I'm only concerned to deploy my data factory to other environments.


For the (Azure Key Vault) task, I set "Key vault" name to this value: azu-caes-$(Environment)-keyvault, which will support all three environments' Key Vaults depending on the value of the $(Environment) group variable {dev, tst, prd}.


For the (Azure Resource Group Deployment) task, I set the following attributes:
- Action: Create or update resource group
- Resource group: azu-caes-$(Environment)-rg
- Template location: Linked artifact
- Template: in my case the linked ART template file location is
  $(System.DefaultWorkingDirectory)/Development/azu-eus2-dev-datafactory/ARMTemplateForFactory.json
- Template parameters: in my case the linked ARM template parameters file location is 
  $(System.DefaultWorkingDirectory)/Development/azu-eus2-dev-datafactory/ARMTemplateParametersForFactory.json
- Override template parameters: some of my ADF pipeline parameters I either leave blank or with default value, however, I specifically set to override the following parameters:
   + factoryName: $(factoryName)
   + blob_storage_ls_connectionString: $(blob-storage-ls-connectionString)
   + AzureKeyVault_ls_properties_typeProperties_baseUrl: $(AzureKeyVault_ls_properties_typeProperties_baseUrl)
Deployment mode: Incremental


2.g Add Release Pipeline variables:
To support reference of variables in my ARM template deployment task, I add the following variables within the "Release" scope:
factoryName
AzureKeyVault_ls_properties_typeProperties_baseUrl

2.h Clone Testing stage to a new Production stage:
By clicking the "Clone" button in the "Testing" stage I create a new release stage and name it as "Production". I will keep both tasks within this new stage unchanged, where the $(Environment) variable should do the whole trick of switching between two stages.

2.i Link Variable groups to Release stages:
A final step in creating my Azure Data Factory release pipeline is that I need to link my Variable groups to corresponding release stages. I go to the Variables section in my Release pipeline and select Variable groups, by clicking [Link variable group] button I choose to link a variable group to a stage so that both of my Testing and Production sections would look the following way:


After saving my ADF Release pipeline, I keep my fingers crossed :-) It's time to test it!

Step 3: Create and test a new release
I click [+ Create a release] button and start monitoring my ADF release process. 



After it had been successfully finished, I went to check my Production Data Factory (azu-eus2-prd-datafactory) and was able to find my deployed pipeline which was originated from the development environment. And I successfully test this Production ADF pipeline by executing it.


One thing to note, that all data connections to Production Key Vault and Blob Storage account have been properly deployed though by ADF release pipeline (no manual interventions).

Step 4: Test CI/CD process for my Azure Data Factory
Now it's time to test a complete cycle of committing new data factory code change and deploying it both to Testing and Production environments.

1) I add a new "Wait 5 seconds" activity task to my ADF pipeline and saving it to my development branch of the (azu-eus2-dev-datafactory) development data factory instance.
2) I create a pull request and merge this change to the master branch of my GitHub repository
3) Then I publish my new ADF code change from master to the adf_publish branch.
4) And this automatically triggers my ADF Release pipeline to deploy this code change to Testing and then Production environments.


Don't forget to enable "Continues deployment trigger" on your Artifact release pipeline and select the final 'adf-publish' code branch as your filter. Otherwise, all of your GitHub branches code commitment might trigger release/deployment process.




5) And after the new release is successfully finished, I can check that the new "Wait" activity task appears in the Production data factory pipeline:

Summary:
1) I feel a bit exhausted after writing such a long blog post. I hope you will find it helpful in your own DevOps pipelines for your Azure Data Factory solutions.
2) After seeing a complete CI/CD cycle of my Azure Data Factory where the code was changed, integrated into the main source code repository and automatically deployed, I can confidently say that my data factory has been successfully released! :-)

Link to my GitHub repository with the ADF solution from this blog post can be found here: https://github.com/NrgFly/Azure-DataFactory-CICD

And happy data adventure! 

25 comments:

  1. You've used the environment parameter in the Resource Group. But what if there are 2 Data Factories in the same Resource Group? And only the name is different like xx-dev-xx or xx-uat-xx. You're variables are related to the Resource Group Name.

    ReplyDelete
    Replies
    1. Thanks for your comment. Yes, different environment Azure objects should stay in different resource groups, otherwise you don't need set $(Environment) variable, and a hard-coded resource group name would work, which I don't recommend.

      I created 3 different resource groups: azu-caes-dev-rg, azu-caes-tst-rg, azu-caes-prd-rg.
      And within each groups I created 3 Azure resource objects:
      - Key Vault, azu-caes-$(Environment)-keyvault
      - Storage Account, azucaes$(Environment)storageaccount
      - Data Factory, azu-eus2-$(Environment)-datafactory

      9 objects in total for my blog post use-case. Your case could be different.

      Delete
    2. @Unknown, As Rayis mentioned, using resource groups and subscriptions are common methods to delineate environments (subscriptions that is). Resource groups are for related objects that have a similar lifecycle and should be deployed/updated together.

      Delete
  2. Rayis with more than one developer or several projects with many DF pipelines and DF datasets using data factory. what is the best way to separate projects allow some assets to be developed while other are moved through the a devops pipeline to release.

    ReplyDelete
    Replies
    1. Hi, @Ozhug, treat your Data Factory as one complex SSIS project, where each of your ADF pipelines could be considered as a set of SSIS packages. For a team development in Data Factory, each developer could create a feature branch from a Git source control, and after finishing development work for a particular branch their code changes could be merged with a development or master-release branch, which then will be a candidate for a release and deployment using DevOps.

      Delete
  3. We have ADF running in our DEV and higher environments. The DEV envorionment is backed by a GIT repo and releases are automated via Azure DevOps / ARM templates.

    Now, we are cleaning up some object which are not in use anymore. Several pipelines have to be deleted. But we're getting failures when we try to publish. The failure has a description like: "The document cannot be deleted since it is refernced by....". In some cases, this is not true. In fact, the so called referencing objects don't even exist anymore. The ARM files have no entry with the given name...

    Does this sound familair to anybody? Does anyone know how to fix this?

    ReplyDelete
    Replies
    1. Yes, it does sound familiar. My coworker (Patrick) helped me to understand this, so I will give him full credits.
      Since ARM deploys in incremental mode, it does not clean up after itself for objects that have been deleted. The script deals with that:
      https://docs.microsoft.com/en-us/azure/data-factory/continuous-integration-deployment#sample-script-to-stop-and-restart-triggers-and-clean-up
      Patrick made improvements to that script.
      You just need to add this script as PowerShell step before deploying your ARM template.

      Delete
    2. Hi Rayis, we are already using this. And this works for deleting the objects. But never the less publishing has a certain check what detects a reference what does not exists any more.

      Delete
    3. This comment has been removed by the author.

      Delete
    4. My previous comment was not entirely correct after really fixing the issue. The issue was caused by old Pipelines in ADF what where created pre GIT. There where references to existing pipelines but these where not in de GIT environment, so wherent cleaned by GIT. After deleting them in our ADF dev environment by hand the issue was fixed. And for releasing to DAP we use the powershell script provided by microsoft with some adjustments.

      Delete
  4. Would you please elaborate on Agent job / Agent pool setting?

    ReplyDelete
    Replies
    1. Yes, Microsoft did a good job introducing time schedule triggers for ADF pipelines, similarly as for Agent jobs for SSIS packages).
      https://docs.microsoft.com/en-us/azure/data-factory/how-to-create-schedule-trigger

      Delete
    2. Thanks for your quick reply. But this is not what I meant. When I create a release pipeline the first thing which I need to configure is an "Agent job" and you have to select one of different options ( Default (Default), Hosted (Hosted), Hosted macOS (Hosted macOS), Hosted VS2017 (Hosted VS2017), etc.)
      I want to know what is the role of this and in terms of deploying ADF what is the best option and how to configure it.

      Delete
    3. My apologies, I misunderstood your question. I will admit that before creating DevOps pipeline in Azure I knew nothing about Agent and Agent pools. During our training with Microsoft, they instructed us that for initial use build and release pipelines Hosted VS2017 is sufficient, it's a shared environment to host resources and logs related to DevOps pipelines processing. During testing with more people in our team, we discovered the Hosted VS2017 may become slow, therefore dedicated VM would help. At my company, we have a DevOps team who manages those agent pools VMs and in case of a detailed investigation, we may request them to provide us with some additional logging information after each of the release pipeline running. I hope this helps.
      Otherwise, Microsoft documentation sites are very informative as well: https://docs.microsoft.com/en-us/azure/devops/pipelines/agents/pools-queues?view=azure-devops

      Delete
  5. Hello Rayis,

    I'm stuck at creating the 2.g. because i dont have any KeyVault for ADF.
    its nice work. if you make video of this complete process and upload that helps us little bit more clear.

    could you please help on this to complete my task.

    Regards,
    Rakesh Dontha
    rakeshdontha34@gmail.com

    ReplyDelete
  6. can you suggest me what are the variables to be created in Keyvault for ADF for Dev and test. but i didnt used keyvault in my ADF because these ADF's are old.

    ReplyDelete
    Replies
    1. If you don't have KeyVault, it won't heart to creating one. It's a good and secure place to keep connection strings and other access related items there that you can securely reference during building data connectors in your ADF pipelines.
      So for my blog post, I created two secrets:
      1) blob-storage-key: to connect to blob storage from my ADF pipeline
      2) blob-storage-ls-connectionString: which I use in my build pipeline and pass for the same name template parameter of my deployment ARM template. You can see them all in my adf_publish GitHub branch:
      https://github.com/NrgFly/Azure-DataFactory-CICD/blob/adf_publish/azu-eus2-dev-datafactory/ARMTemplateForFactory.json

      Let me know if this helps you.

      Delete
    2. Thanks Rayis,

      Thanks for your active response. what are keys to be created for azure sqlDB? Is one sql connectionstring is enough to execute my process?

      Delete
    3. Try connection string or password and decide which one you will use.

      Delete
    4. I have created the secret key for connection string. but im unable to connect to LinkedService in ADF connection. im getting access denied error while testing the connection string. could you please help me out.

      ERROR: Failed to get the secret from key vault, secretName: ****, secretVersion: , vaultBaseUrl: https://****.vault.azure.net/. The error message is: Access denied

      what changes i need to do and where.

      Delete
    5. You need to grant ADF to access your Key Vault. Please read the steps in this documentation page:
      https://docs.microsoft.com/en-us/azure/data-factory/store-credentials-in-key-vault

      Delete
    6. Hello Rayis,

      i'm out of that error now. now running into new error.
      while im releasing the build getting following MIS error.

      Could not fetch access token for Managed Service Principal. Please configure Managed Service Identity (MSI) for virtual machine.
      could you please help me on this

      Delete
    7. It's hard to tell without knowing what step is it failing. Did you grant your VSTS pipelines to access your Key Vault?
      Please check if you followed these steps:
      https://github.com/MicrosoftDocs/azure-docs/blob/master/articles/data-factory/continuous-integration-deployment.md#grant-permissions-to-the-azure-pipelines-agent

      Delete
  7. granting access from adf to the key vault is the answer indeed

    ReplyDelete
  8. This comment has been removed by the author.

    ReplyDelete