Populating PostgreSQL JSONB column using Azure Data Factory Data Flow

(2022-May-31) 

Personal notes of one of the recent ADF JSON pipeline development to remember how to use it next time.

Requirements:

  1. Sourcing data comes from a SQL Server database
  2. The destination is a PostgreSQL database table
  3. Transformation logic is to aggregate several rows from a sourcing table and populate the resulting JSON structured document into a single row JSONB type column

Things to remember:

  1. It does help to prepare a resulting JSON document in a sourcing query as a string expression with all the required JSON elements within to save time for validation in the ADF pipeline
  2. T-SQL STRING_AGG function is a nice alternative to combine multiple table rows into a single text value, originally I would rely on a SELECT … FOR JSON PATH T-SQL statements for this.
  3. A good way to validate the sourcing data is to extract it into blob storage and then pass this extracted further toward the ETL pipeline. Don’t forget to set ‘No…’ for Escape character and Quote character settings in the blob storage souring dataset.
  4. Special characters to pay attention to:
    1. ADF Escape data flow expression allows to format an input string to a particular format, however, I thought to control the construction of my JSON document in the sourcing query myself without using this command. This also saved me some time to troubleshoot several data transformation issues.
    2. CR = Carriage Return (0x0D in hexadecimal, 13 in decimal) and LF = Line Feed (0x0A in hexadecimal, 10 in decimal), could be replaced with \r,  and \n, however, I made a decision to remove them in my sourcing query.
    3. TABs can also be replaced with \t, but I just simply replace them with space ‘ ‘ in the sourcing query as well.
  5. Alter row transformation in the mapping data flow is a great way to support multiple transformation scenarios (Insert, Update, Delete, Upsert) in a single data flow stream.
  6. JSONB data column in PostgreSQL provides its own data validation during the ETL process. When my ADF pipeline tried to push data to that column and I was getting this error message, “StatusCode": "DFExecutorUserError”, that meant that my final JSON document is not well-formatted and doesn’t meet JSON type requirements.

Comments