Search This Blog

Monday, February 18, 2019

Continuous integration and delivery (CI/CD) in Azure Data Factory using DevOps and GitHub

(2019-Feb-18) With Azure Data Factory (ADF) continuous integration, you help your team to collaborate and develop data transformation solutions within the same data factory workspace and maintain your combined development efforts in a central code repository. Continuous delivery helps to build and deploy your ADF solution for testing and release purposes. Basically, the CI/CD process helps to establish a good software development practice and aims to build a healthy relationship between development, quality assurance, and other supporting teams.

Back in my SQL Server Integration Services (SSIS) heavy development times, SSIS solution deployment was a combination of building ispac files and running a set of PowerShell scripts to deploy this file to SSIS servers and update required environmental variables if necessary.

Microsoft introduced DevOps culture for software continuous integration and delivery - https://docs.microsoft.com/en-us/azure/devops/learn/what-is-devops, and I really like their starting line, "DevOps is the union of people, process, and products to enable continuous delivery of value to our end users"!

Recently, they've added a DevOps support for Azure Data Factory as well - https://docs.microsoft.com/en-us/azure/data-factory/continuous-integration-deployment and those very steps I tried to replicate in this blog post.





My ADF pipeline use-case to deploy

So, here is my use case: I have a pipeline in my Azure Data Factory synced in GitHub (my Development workspace) and I want to be able to Test it in a separate environment and then deploy to a Production environment.
Development > Testing > Production

In my ADF I have a template pipeline solution "Copy multiple files containers between File Stores" that I used to copy file/folders from one blob container to another one.



In my Azure development resource group, I created three artifacts (which may vary depending on your own development needs):
- Key Vault, to store secret values
- Storage Account, to keep my testing files in blob containers
- Data Factory, the very thing I tend to test and deploy


Creating continuous integration and delivery (CI/CD) pipeline for my ADF

Step 1: Integrating your ADF pipeline code to source control (GitHub)
In my previous blog post (Azure Data Factory integration with GitHub) I had already shown a way to sync your code to the GitHub. For the continuous integration (CI) of your ADF pipeline, you will need to make sure that along with your development (or feature) branch you will need to publish your master branch from your Azure Data Factory UI. This will create an additional adf_publish branch of your code repository in GitHub. 
development > master > adf_publish

As Gaurav Malhotra (Senior Product Manager at Microsoft) has commented about this code branch, "The adf_publish branch is automatically created by ADF when you publish from your collaboration branch (usually master) which is different from adf_publish branch... The moment a check-in happens in this branch after you publish from your factory, you can trigger the entire VSTS release to do CI/CD"

Publishing my ADF code from master to adf_publish branch automatically created two files:
Template file (ARMTemplateForFactory.json): Templates containing all the data factory metadata (pipelines, datasets etc.) corresponding to my data factory.

Configuration file (ARMTemplateParametersForFactory.json): Contains environment parameters that will be different for each environment (Development, Testing, Production etc.).




Step 2: Creating a deployment pipeline in Azure DevOps

2.a Access to DevOps: If you don't have a DevOps account, you can start for free - https://azure.microsoft.com/en-ca/services/devops/ or ask your DevOps team to provide you with access to your organization build/release environment.

2.b Create new DevOps project: by clicking [+ Create project] button you start creating your build/release solution.

2.c Create a new release pipeline
In a normal software development cycle, I would create a build pipeline in my Azure DevOps project, then archive the built files and use those deployment files for further release operations. As Microsoft team indicated in their CI/CD documentation page for Azure Data Factory, the Release pipeline is based on the source control repository, thus the Build pipeline is omitted; however, it's a matter of choice and you can always change it in your solution.

Select [Pipelines] > [Release] and then click the [New pipeline] button in order to create a new release pipeline for your ADF solution.



2.d Create Variable groups for your solution
This step is optional, as you can add variables directly to your Release pipelines and individual releases. But for the sake of reusing some of the generic values, I created three groups in my DevOps project with the same (Environment) variable for each of these groups with ("dev", "tst", "prd") corresponding values within.


2.e Add an artifact to your release pipeline
In the Artifact section, I set the following attributes:
- Service Connection my GitHub account
- GitHub repository for my Azure Data Factory
- Default branch (adf_branch) as a source for my release processing
Alias (Development) as it is the beginning of everything
 


2.f Add new Stage to your Release pipeline
I click [+ Add a stage] button, select "Empty Job" template and name it as "Testing". Then I add two agent jobs from the list of tasks:
- Azure Key Vault, to read Azure secret values and pass them for my ADF ARM template parameters 
- Azure Resource Group Deployment, to deploy my ADF ARM template


In my real job ADF projects, DevOps Build/Release pipelines are more sophisticated and managed by our DevOps and Azure Admin teams. Those pipelines may contain multiple PowerShell and other Azure deployment steps. However, for my blog post purpose, I'm only concerned to deploy my data factory to other environments.


For the (Azure Key Vault) task, I set "Key vault" name to this value: azu-caes-$(Environment)-keyvault, which will support all three environments' Key Vaults depending on the value of the $(Environment) group variable {dev, tst, prd}.


For the (Azure Resource Group Deployment) task, I set the following attributes:
- Action: Create or update resource group
- Resource group: azu-caes-$(Environment)-rg
- Template location: Linked artifact
- Template: in my case the linked ART template file location is
  $(System.DefaultWorkingDirectory)/Development/azu-eus2-dev-datafactory/ARMTemplateForFactory.json
- Template parameters: in my case the linked ARM template parameters file location is 
  $(System.DefaultWorkingDirectory)/Development/azu-eus2-dev-datafactory/ARMTemplateParametersForFactory.json
- Override template parameters: some of my ADF pipeline parameters I either leave blank or with default value, however, I specifically set to override the following parameters:
   + factoryName: $(factoryName)
   + blob_storage_ls_connectionString: $(blob-storage-ls-connectionString)
   + AzureKeyVault_ls_properties_typeProperties_baseUrl: $(AzureKeyVault_ls_properties_typeProperties_baseUrl)
Deployment mode: Incremental


2.g Add Release Pipeline variables:
To support reference of variables in my ARM template deployment task, I add the following variables within the "Release" scope:
factoryName
AzureKeyVault_ls_properties_typeProperties_baseUrl

2.h Clone Testing stage to a new Production stage:
By clicking the "Clone" button in the "Testing" stage I create a new release stage and name it as "Production". I will keep both tasks within this new stage unchanged, where the $(Environment) variable should do the whole trick of switching between two stages.

2.i Link Variable groups to Release stages:
A final step in creating my Azure Data Factory release pipeline is that I need to link my Variable groups to corresponding release stages. I go to the Variables section in my Release pipeline and select Variable groups, by clicking [Link variable group] button I choose to link a variable group to a stage so that both of my Testing and Production sections would look the following way:


After saving my ADF Release pipeline, I keep my fingers crossed :-) It's time to test it!

Step 3: Create and test a new release
I click [+ Create a release] button and start monitoring my ADF release process. 



After it had been successfully finished, I went to check my Production Data Factory (azu-eus2-prd-datafactory) and was able to find my deployed pipeline which was originated from the development environment. And I successfully test this Production ADF pipeline by executing it.


One thing to note, that all data connections to Production Key Vault and Blob Storage account have been properly deployed though by ADF release pipeline (no manual interventions).

Step 4: Test CI/CD process for my Azure Data Factory
Now it's time to test a complete cycle of committing new data factory code change and deploying it both to Testing and Production environments.

1) I add a new "Wait 5 seconds" activity task to my ADF pipeline and saving it to my development branch of the (azu-eus2-dev-datafactory) development data factory instance.
2) I create a pull request and merge this change to the master branch of my GitHub repository
3) Then I publish my new ADF code change from master to the adf_publish branch.
4) And this automatically triggers my ADF Release pipeline to deploy this code change to Testing and then Production environments.


Don't forget to enable "Continues deployment trigger" on your Artifact release pipeline and select the final 'adf-publish' code branch as your filter. Otherwise, all of your GitHub branches code commitment might trigger release/deployment process.




5) And after the new release is successfully finished, I can check that the new "Wait" activity task appears in the Production data factory pipeline:

Summary:
1) I feel a bit exhausted after writing such a long blog post. I hope you will find it helpful in your own DevOps pipelines for your Azure Data Factory solutions.
2) After seeing a complete CI/CD cycle of my Azure Data Factory where the code was changed, integrated into the main source code repository and automatically deployed, I can confidently say that my data factory has been successfully released! :-)

Link to my GitHub repository with the ADF solution from this blog post can be found here: https://github.com/NrgFly/Azure-DataFactory-CICD

And happy data adventure! 

Monday, February 11, 2019

Creating custom solution templates in Azure Data Factory

(2019-Feb-11) Azure Data Factory (ADF) provides you with a framework for creating data transformation solutions in the Microsoft cloud environment. Recently introduced Template Gallery for ADF pipelines can speed up this development process and provide you with helpful information to create additional activity tasks in your pipelines.

We naturally long to seek if something standard can be further adjusted. That custom design is like ordering a regular pizza and then hitting the "customize" button in order to add a few toppings of our choice. It would be very impressive then to save this customized "creation" for future ordering. And Azure Data Factory has a similar option to save your custom data transformation solutions (pipelines) as templates and reuse them later.



When you open your Data Factory UI, the option to see and create your own templates is available for you. Live factory mode doesn't allow you to add custom solution templates into the gallery. This "Templates" feature is only available for you in Azure Data Factory GIT integrated mode. 

 


Here is how I was able to test this:

Step 1: Saving pipeline as a template

I've modified a pipeline that was originally created from the "Copy multiple files containers between File Stores" ADF template and added my custom Web Email Notification activity task at the end of the workflow. I have also made some additional changes to the Name and Description of this new "Copy Files with Email Notification" pipeline.


Then I can hit the "Save as template" button on the Pipeline tab to add this existing solution to the ADF Gallery. A new window will show up with an overview of all the activity tasks and Git location for the new template. In my case, this template was saved to my personal GitHub.




Step 2: Verifying your custom solution in the Template Gallery

After successfully saving your ADF pipeline to the Template Gallery its name will appear under the Template on the left-hand side of the Azure Data Factory UI. If you need to change this template you can either add those code changes in the GitHub template JSON files directly or drop and recreate it from the existing pipeline again.


I also was able to see my custom ADF pipeline solution "Copy File with Email Notification" in the Template Gallery. Which grabs your attention to see your own name along with the Microsoft among the list of template contributors. Even if your creation only exists within your own Data Factory workspace.





Step 3: Using your custom solution template

You can either hit the ellipsis button beside the Pipelines and select the "Pipeline from template" in order to open the Gallery Template or go right to the "Templates" and choose your new customer pipeline solution. It will open up a window and offer you to set linked services, or in a case of your own custom solution template, those options may vary. 



Difference between a customer solution ADF template and a cloned pipeline:
You can always clone an existing pipeline in ADF and treat it as a template for your future development, however, there are some differences between this approach and pipeline from the Template gallery.
1) You custom solution template will be saved in a separate "/template" Git location and it will also be coupled with the manifest.json file.
2) A cloned pipeline will preserve all the original settings from its parent pipeline vs. interactive UI experience to select/create additional settings (e.g. linked services) when you add a new pipeline from a template.

Personally, I really liked this option of saving my custom ADF pipeline solution as a template. It helps me to think that you can now create a whole new data factory as a template :-)

Official Microsoft documentation for solution templates in Azure Data Factory: 
https://docs.microsoft.com/en-us/azure/data-factory/solution-templates-introduction 

My custom "Copy File with Email Notification" solution template in my personal GitHub: 
https://github.com/NrgFly/Azure-DataFactory/tree/master/Samples/templates

Friday, February 8, 2019

Developing pipelines in Azure Data Factory using Template gallery

(2019-Feb-08) I often see demos of Azure Data Factory (ADF) development techniques with a modest set of activity tasks and not very complex logic for a data transformation process. Knowing more about ADF capability may generate additional interest to learn and start creating data transformation projects within this platform; additionally, the Mapping Data Flows in ADF will help to boost its adoption level when it becomes generally available.

A good friend of mine, Simon Nuss, reminded me this morning, that ADF had been enriched with an additional feature that can help to create pipelines using existing templates from the Template gallery. All you need to do is to click the ellipsis beside Pipelines in order to select this option:


Currently, there are 13 templates that you can start using right away:
- Bulk Copy from Database
- Copy data from Amazon S3 to Azure Data Lake Store
- Copy data from Google BiqQuery to Azure Data Lake Store
- Copy data from HDFS to Azure Data Lake Store
- Copy data from Netezza to Azure Data Lake Store
- Copy data from on premise SQL Server to SQL Azure
- Copy data from on premise SQL Server to SQL Data Warehouse
- Copy data from Oracle to SQL Data Warehouse
- Copy multiple files containers between File Stores
- Delta copy from Database
- ETL with Azure Databricks
- Schedule Azure-SSIS Integration Runtime to execute SSIS package
- Transform data using on-demand HDInsight

And I would expect that Microsoft will be adding more new templates to this list.

Test case:

To test this out, I decided to use "Copy multiple files containers between File Stores" template with a use case of copying sourcing data files from one blob container to another staging blob container.

I already have a blob container storesales with several CSV files and I want to automate the copying process to a new container storesales-staging that I have just created in my existing blob storage account:


Step 1: Selecting an ADF pipeline template from the gallery

After choosing the "Copy multiple files containers between File Stores" template, a window pops up where I can set linked services for both source and sink file stores.


After finishing this, a new pipeline is created in my ADF that has two activity tasks: "Get Metadata" and "For Each" container. There is also a link to the Microsoft official documentation website for Azure Data Factory, which will only be available during the initial working stage with this new pipeline; next time when you open it, this link will no longer be there.




Step 2: Parameters setting

This particular pipeline template already has two parameters which I set to my sourcing and destination file paths of my blob storage to "/storesales" and "/storesales-staging" correspondingly:



Step 3: Testing my new ADF pipeline

Further along the way, I test my new pipeline in the Debug mode. It gets successfully executed:


And I can also see that all my CSV files were copied into my destination blob storage container. One of the test cases using ADF pipeline templates is successfully finished!


Summary:
1) Gallery templates in Azure Data Factory is a great way to start building your data transfer/transformation workflows.  
2) More new templates will become available, I hope.
3) And you can further adjust pipelines based on those templates with your custom logic. 

Let's explore and use ADF pipelines from its Template gallery more! 
And let's wait for Mapping Data flows general availability, it will become even more interesting! :-)

Wednesday, February 6, 2019

Azure Data Factory integration with GitHub

(2019-Feb-06) Working with Azure Data Factory (ADF) enables me to build and monitor my Extract Transform Load (ETL) workflows in Azure. My ADF pipelines is a cloud version of previously used ETL projects in SQL Server SSIS.

And prior to this point, all my sample ADF pipelines were developed in so-called "Live Data Factory Mode" using my personal workspace, i.e. all changes had to be published in order to be saved. This hasn't been the best practice from my side, and I needed to start using a source control tool to preserve and version my development code.

Back in August of 2018, Microsoft introduced GitHub integration for Azure Data Factory objects - https://azure.microsoft.com/en-us/blog/azure-data-factory-visual-tools-now-supports-github-integration/. Which was a great improvement from a team development perspective. 


So now is the day to put all my ADF pipeline samples to my personal GitHub repository.
Each of my previous blog posts:
1) Setting Variables in Azure Data Factory Pipelines
2) Append Variable activity in Azure Data Factory: Story of combining things together
3) System Variables in Azure Data Factory: Your Everyday Toolbox
4) Email Notifications in Azure Data Factory: Failure is not an option 

Has a corresponding pipeline created in my Azure Data Factory:


And all of them are now publically available in this GitHub repository:
https://github.com/NrgFly/Azure-DataFactory

Let me show you how I did this using my personal GitHub account; you can do this with enterprise GitHub accounts as well.

Step 1: Set up Code Repository
A) Open your existing Azure Data Factory and select the "Set up Code Repository" option from the top left "Data Factory" menu:


B) then choose "GitHub" as your Repository Type:


C) and make sure you authenticate your GitHub repository with the Azure Data Factory itself: 


Step 2: Saving your content to GitHub


After selecting an appropriate GitHub code repository for your ADF artifacts and pressing Save button:


You can validate them all in the GitHub itself. Source code integration allowed me to save all my AFD artifacts: pipelines, datasets, linked services, and triggers.


And that's where I can see all my four ADF pipelines:



Step 3Testing your further changes in ADF pipelines

Knowing, that all my ADF objects are now stored in GitHub, let's see if a code change from Azure Data Factory will be synchronized there.

I add a new description to my pipeline with Email notifications:


After saving this change in ADF I can see how it's being synchronized in my GitHub repository:


Summary:
1) GitHub integration with Azure Data Factory is possible.
2) And now I'm a bit closer to automating my deployment process and use Azure DevOps VSTS to create my CI/CD pipelines! :)

Monday, January 28, 2019

Can I add a custom reference layer to ArcGIS and use it in Power BI?

(2019-Jan-28) When you work with maps using ArcGIS visual in Power BI, you always have a feeling that it is a tool within another tool. On a surface level, you have options to set data attributes for geo coordinates, coloring and time controls. However, when you go into the Edit mode of ArcGIS, the possibility to adjust your map visualization is expanded to setting base maps, location types, map themes, symbol styles, and pins as well as infographics and reference layers




Reference layers are the additional shape/geo objects that you can to add in ArcGIS Power BI to enhance your data story with more contextual elements related to your existing maps.
Here is an extract from the official ERSI documentation:
"When you add a reference layer to the map, you're providing context for the data you're already displaying. Reference layers can include demographic data, such as household income, age, or education. They can also include publicly shared feature layers available on ArcGIS Online that provide more information about areas surrounding the locations on your map. For example, if your data layer shows the location of fast-food restaurants, you could add a reference layer that shows the proximity of high schools and universities. Reference layers allow you to dig deeper into your data to provide a greater picture of your business information".
https://doc.arcgis.com/en/maps-for-powerbi/design/add-a-reference-layer.htm



How can add my own custom layer (shape) to ArcGIS and use them further down in my Power BI report? This was a point of my interest and a result of questions from other people! To make my further attempts to explore this very topic I owe to this blog post: https://dataveld.com/2016/10/02/how-to-add-your-own-arcgis-reference-layer-for-power-bi/ written by David Eldersveld where he shares very detailed steps of creating custom reference layers to be further found in Power BI:
  Step 1 – Sign in to ArcGIS Online
  Step 2 – Choose the source file from your computer
  Step 3 – Share your feature layer
  Step 4 – Search your reference layer in Power BI

In my new case, I wanted to test out my own geo shapes that I had already created using QGIS application (blogged about this already: http://datanrg.blogspot.com/2019/01/creating-my-own-map-shapefiles-for.html). So, can I transform my Giza Pyramids shapes into the ArcGIS reference layers and find them in Power BI?

Before you sign yourself into ArgGIS Online, there a few things you will need to decide on what type of account you can use there.

ArcGIS Public Account:
- ArcGIS Public Account is a personal account with limited usage and capabilities and is meant for non-commercial use only.
- You can still create feature layers and maps and further share them publically
- You shared feature layers will be stored in the public Feature Collection.

ArcGIS Organizational Account:
- As a member of an organization, you will have access to the organization's geospatial content that you can use to create maps. You can also share your work with other members of your organization, participate in groups, and save your work.
- You can create feature hosted layers and maps and further share them publically
- You hosted shared feature layers will be stored in the Feature Services.

And here is a very important thing, the only way for Power BI to see your created feature layers is when they are created in your organizational ArcGIS workspace and shared as a hosted feature layer. Public ArcGIS account access won't provide you with this functionality. 

The trial of ArcGIS Online, which would give you an organizational account for 21 days: https://www.esri.com/en-us/arcgis/trial. Your content would be lost after the 21-day period, however.

So, following David's Eldersveld initial set of steps:
  Step 1 – I've accessed ArcGIS and applied for their organizational account trial.

  Step 2 – I've created a new geo item in my ArcGIS workspace and selected a zip file with my shapefiles of Giza Pyramids that I had previously created:

I provided tile and tags for this new item, and I have also selected a checkbox to make this feature a hosted layer.

  Step 3 – By clicking [Share] button I made it available and searchable to ArcGIS map in Power BI:



  Step 4 – Search your reference layer in Power BI
I've added and publicly shared another pentagon-shaped layer that I manually created in QGIS before. Both feature layers looked this way in my ArcGIS workspace content:



And this a culminating moment for me: I was finally able to locate




And use my publicly available reference layers in Power BI:



It is always a rewarding feeling to experience when a quest to validate something unknown results in a successful outcome. However, it's too sad that sharing custom feature layers in ArcGIS Online using personal access account doesn't allow to publish your shared layers to ERSI hosted feature service repository. And yes, currently this is possible through ArcGIS organizational account access only.

Perhaps this will be improved in the future. In either way, after working both in QGIS and ArcGIS tools, this whole GIS technology is no longer rocket science to me. It's only a matter of time to get more experienced with it! :-)