(2020-Oct-05) Adding a row number to your dataset could a trivial task. Both ANSI and Spark SQL have the row_number() window function that can enrich your data with a unique number for your whole or partitioned data recordset.
Recently I had a case of creating a data flow in Azure Data Factory (ADF) where there was a need to add a row number.
Photo by Micah Boerma from Pexels
Instant reaction was to this new additional row number column using a derived column transformation - https://docs.microsoft.com/en-us/azure//data-factory/data-flow-derived-column. However, this was my mistake and ADF notified me that rowNumber() - data flow function was only available in the Windows transformations - https://docs.microsoft.com/en-us/azure/data-factory/data-flow-window.
OK, I moved on and added a Window Transformation task, which does require at least one sorting column.
However, this was a bit of an issue in my data case. I didn’t need to sort and change the order of my recordset. Transformed dataset needed to be in the same order as its original sourcing data, as it would help to locate a transformed data record with its sourcing sibling record with the help of the new Row Number column.So I thought that it wouldn't hurt to add just a static constant value column (let’s say with value = 1 ) and then use this new (1) valued column in my rowNumber() window transformation as a sorting attribute.
1) First I added a Derived Column transformation with the Column_Value_1 column value set to 1 (or any other constant value of your preference).
2) Then I added a Window transformation with the rowNumber() function and Column_Value_1 as a sorting column.
3) Then I was able to see my output result with preserved order of records and additional Row Number column.
Update:
Joseph Edwards has just pointed out that using the Surrogate Key transformation would achieve the very same result of adding a row number column, and this step doesn't even require sorting columns. I've just tested, and it's working, amazing! I have written this whole blog post just to realize that there is a better way to number rows in the Azure Data Factory flows :-)
Comments
Post a Comment