Delete activity in Azure Data Factory - Cleaning up your data files

(2019-Mar-20) File management may not be at the top of my list of priorities during data integration projects. I assume that once I learn enough about sourcing data systems and target destination platform, I'm ready to design and build a data integration solution between two or more connecting points. Then, a historical file management process becomes a necessity or a need to log and remove some of the incorrectly loaded data files. Basically, a step in my data integration process to remove (or clean) such files would be helpful. 

Image by Skentophyte from Pixabay 

Last month, Microsoft introduced this new Delete Activity in the Azure Data Factory (ADF) that allows implementing those and many other use-cases: https://docs.microsoft.com/en-us/azure/data-factory/delete-activity.

So, I wanted to explore more and test the following use-cases with this new ADF Delete activity:
a) Remove sourcing files after copying them to a staging zone,
b) Delete files based on their timestamp (historical files management),
c) Detele files from an on-premise system.

Delete file use-case:
(A) Remove sourcing files after copying them to a staging zone
For this use-case, I will be using my ADF pipeline that I had already created for my previous blog post - Developing pipelines in Azure Data Factory using Template gallery

Where after data files copying activity I will include the next step to remove my sourcing data files since I will already have them in my staging blob container.


And my expectation would be to see my staging "storesales-staging" container with the copied files 

and my sourcing files blob container "storesales" to be empty.

To make this happen, I only need to set a blob storage dataset for my Delete task activity with a folder name and indication that I need to delete files recursively.


and then on that specific blob storage dataset I just need to specify a file mask (*) for all the files to be removed:


After running my ADF pipeline with the new Delete activity task, all sourcing files get successfully copied to the staging container and they are gone from the sourcing side as well. 


(B) Historical files management
This is a very interesting case when you want to handle your file management process based on time-related attributes of your data files, i.e. when the last time they were modified or loaded.

Microsoft introduced time-based filters that you could apply to your Delete activity. It's currently managed on the dataset level and allows to make a reference to your files' repository that needs to be cleaned. You potentially can say, I want to delete 1-month-old, 1-year-old  or even 1 hour or minute old files. And I find very helpful that you can use particular date functions that exist in Azure Data Factory:


It only takes to add this time-related condition to my Delete activity dataset and voilà my files which are 10 seconds old after just copying them are deleted as well! You can adjust this filtering condition to your specific data scenario.



(C) Detele files from an on-premise system
This is one of my favorite test cases, how cool this is to remove files from my local computer C:\ drive using cloud-based workflow in Azure Data Factory :-).

To enable this scenario, you will need to install Integration runtime in your on-premise environment. Then you need to create a linked service to your on-premise file system in your data factory, which will require:
- Host (root path of a folder of your delete activity)
- User name and Password to access your on-premise files (I would recommend to save your password in an Azure Key Vault and then reference that secret name in your data factory pipeline).

The rest is super easy, you just need to reference your newly created file system linked service dataset and specify additional criteria for files to be deleted. In my case, I'm deleting all .txt files from the C:\Temp\test_delete folder on my local computer.


Now I can have a peace of mind knowing that files cleaning job can be done in Azure Data Factory as well! :-)


You can find all the code for these 3 data factory pipeline use-cases in my personal GitHub repository: https://github.com/NrgFly/Azure-DataFactory

Comments

  1. Long path tool can solve your problem easily. It can delete, copy, move and bulk rename long path files or locked file easily. Try it, It is very useful, amazing toole.

    ReplyDelete
    Replies
    1. Thanks Liam, however I'm not sure if the Long path will be a good tool for cloud based data files.

      Delete
  2. In my pipeline, under an IF condition i need to generate an empty file on a specific location if condition goes into False.

    Please suggest how can I create an empty file in ADLS Gen2 through pipeline. Is there any activity available for that..?

    ReplyDelete
    Replies
    1. ADF doesn't have a specific activity to generate an empty file. But you can do this my copying an empty file from a saved location to your specific location based on your condition. Or you can run a Copy Data activity where sourcing side would return 0 records and at your specific location an empty file (with header if you like) could be created.

      Delete
  3. I will suggest Long path tool is very popular software! When cleaning files from a Windows Operating System (OS), you are halted with errors relating to the filename or the file path being too long.

    ReplyDelete

Post a Comment