Don't Trust the Defaults

(2024-Mar-08) Software installation typically involves several steps. After accepting the license agreement, you may have an express option to streamline the installation process by accepting all defaults, including the installation file path and file extension. Alternatively, you can explore custom options if you are curious and wish to tailor the installation according to your preferences.

Existing IT cloud companies offer modern default options for various cloud resources. Smart functions monitor the hosting environment gracefully and automatically release resources if not used. Database components provide automated self-tuning and performance improvement options, among others. The magic word here is "automatically"; everything is done by default as designed.

Image by Tosieq from Pixabay

It was a typical Monday morning in October, much like any other workday. As you arrive at the office, your usual routine involves checking the weekend and overnight data load jobs for your company's data warehouse environment. In most instances, these checks are routine and show a reassuring "green" status with no identified red alerts, signifying stability. However, on this particular Monday, things took an unexpected turn.

Unlike the usual smooth operation, there were numerous failures during the midnight ETL process. The new daily jobs were unusually slow, and attempts to refresh reporting datasets in Power BI were taking significantly longer than usual.

Further investigation of the internal Azure SQL Server health reports revealed two alerts that occurred when our database environment was briefly unavailable. The explanations for these alerts were provided by Microsoft Azure:

  1. Your database was unavailable due to planned maintenance. To ensure service quality and security, we roll out upgrades on a monthly schedule. Upgrade payload may include OS patches and security fixes, new and updated SQL product features, and repairs. Planned maintenance events are almost always short, lasting at most a few seconds. Currently, Azure shows the downtime for your SQL database resource at a two-minute granularity. The actual downtime is likely less than a minute – average is 2s. Please also note the outage window may be shifted by around 5 minutes.
  2. Your database was moved to a different machine to ensure it has the resources required for its compute size. This is an occasional transient operation. Currently, Azure shows the downtime for your SQL database resource at a two-minute granularity. The actual downtime may be less than that. Please also note the outage window may be shifted by around 5 minutes.

We created a support request for the Microsoft Azure team to investigate and provide additional information regarding the sudden slowdown in performance. This was initiated based on the awareness that no changes had been made and deployed to our Production environment during the last several days.

The Microsoft support team proved to be very helpful in addressing our issue. They diligently investigated and identified the root cause of the slow queries, which were primarily related to data retrieval for supporting our reports. Furthermore, they provided valuable recommendations on how to enhance table indexes and refactor the data query code for improved performance.          

The main question was why our team had to implement those changes to improve our performance when no code changes were made on our side, and the data volume hadn’t dramatically increased in the last several days. The only thought was that Microsoft organized and planned maintenance of our database production environment, which resulted in a short downtime and further notice stating, “The database was moved to a different machine to ensure it has the resources required for its compute size.”. Just why!

To make our extended communication story with Azure support team short, we were unsuccessful in delving deeper into why our cloud database production environment was moved to a different machine and where that "machine" was physically located (out of curiosity).

However, they made a series of good suggestions:

  1. Optimize index maintenance to improve query performance and reduce resource consumption (https://learn.microsoft.com/en-us/sql/relational-databases/indexes/reorganize-and-rebuild-indexes?view=sql-server-ver16)
  2. Automate Azure SQL database tables' indexes and statistics with this custom SQL script (https://raw.githubusercontent.com/yochananrachamim/AzureSQL/master/AzureSQLMaintenance.txt), which was initially developed in 2021 and underwent further updates in later years.

This would all make total sense if our SQL Server database were hosted on-premises. Back in the non-cloud times, this work would have been done through scheduled maintenance plans (https://learn.microsoft.com/en-us/sql/relational-databases/maintenance-plans/maintenance-plans?view=sql-server-ver16). However, there is no such thing as a maintenance plan in Azure SQL Server databases. In fact, if you read the technical article from the first link, the word "automatically" is mentioned eight times, explaining that most of the table indexes and statistics are maintained automatically. And all of this is done by default! 

Eventually, we followed the recommendations of the Microsoft support team and scheduled our own database maintenance solution by running their custom SQL script daily to update table statistics and weekly table index updates during off-peak hours. We even went further by stabilizing, dropping, and recreating all the columnstore indexes, as tables with those indexes were significantly impacted by slow performance.

Also, we set up advanced notifications before maintenance events to receive alerts when planned maintenance is due to begin in the next 24 hours (https://learn.microsoft.com/en-us/azure/azure-sql/database/advance-notifications?view=azuresql#configure-an-advance-notification). Additional consideration might also be given to setting up a Maintenance Window for your SQL database to avoid peak business hours (https://learn.microsoft.com/en-us/azure/azure-sql/database/maintenance-window?view=azuresql).

But, ultimately, this crushed my dream of the automation processes promised by a cloud vendor. Don't trust the defaults; be curious about how things work inside. When something goes wrong, you will have a better chance of resolving it because you might have already looked inside and understood the custom and non-default product features better.

Comments