Row Numbers in Azure Data Factory Data Flows

(2020-Oct-05Adding a row number to your dataset could a trivial task. Both ANSI and Spark SQL have the row_number() window function that can enrich your data with a unique number for your whole or partitioned data recordset. 

Recently I had a case of creating a data flow in Azure Data Factory (ADF) where there was a need to add a row number.

Photo by Micah Boerma from Pexels

Instant reaction was to this new additional row number column using a derived column transformation - https://docs.microsoft.com/en-us/azure//data-factory/data-flow-derived-column. However, this was my mistake and ADF notified me that rowNumber() -  data flow function was only available in the Windows transformations - https://docs.microsoft.com/en-us/azure/data-factory/data-flow-window.


OK, I moved on and added a Window Transformation task, which does require at least one sorting column.

However, this was a bit of an issue in my data case. I didn’t need to sort and change the order of my recordset. Transformed dataset needed to be in the same order as its original sourcing data, as it would help to locate a transformed data record with its sourcing sibling record with the help of the new Row Number column.

So I thought that it wouldn't hurt to add just a static constant value column (let’s say with value  = 1 ) and then use this new (1) valued column in my rowNumber() window transformation as a sorting attribute. 

1) First I added a Derived Column transformation with the Column_Value_1 column value set to 1 (or any other constant value of your preference).
2) Then I added a Window transformation with the rowNumber() function and Column_Value_1 as a sorting column.
3) Then I was able to see my output result with preserved order of records and additional Row Number column.



Update:
Joseph Edwards has just pointed out that using the Surrogate Key transformation would achieve the very same result of adding a row number column, and this step doesn't even require sorting columns. I've just tested, and it's working, amazing! I have written this whole blog post just to realize that there is a better way to number rows in the Azure Data Factory flows :-)



Comments