Azure Databricks Performance Notes

(2020-Feb-04) I didn't name this blog post as "Performance Tips" since I'm just creating the list of helpful notes for myself for tuning the performance of my workload with delta tables in Azure Databricks before I forget this. Another reason is that I'm still expanding my experience and knowledge with the Databricks in Azure and there are many other more in-depth resources available on this very topic.

Image by Free-Photos from Pixabay

So here my current list of high-level improvements that I can make to my workload in Azure Databricks:

1) Storage Optimized Spark cluster type.
Considering one of the benefits of using Apache Spark vs. Hadoop data processing that Spark processes data in memory, we still need disks. Those disks are directly attached directly to a Spark cluster and they provide space for shuffle data stages and data spills from the executor/worker if this happens during a workload. So if those disks are slow or fast this will impact how well my data query will be executed in Azure Databricks.

As one of the recommendations to efficiently execute your data queries and read Spark tables' data that is based on Parquet data files in your data lake is to use Storage Optimized clusters. And they prove to be faster than other Spark cluster types in this case.


2) Enable the Delta cache - spark.databricks.io.cache.enabled true
There is a very good resource available on configuring this Spark config setting: https://docs.microsoft.com/en-us/azure/databricks/delta/optimizations/delta-cache

And this will be very helpful in your Databricks notebook's queries when you try to access a similar dataset multiple times. Once you read this dataset for the first time, Spark places it into internal local storage cache and will speed up the process of further referencing it for you.

3) Set an appropriate number of shuffling partitions 
By default the spark.sql.shuffle.partitions setting is to 200, which may not be sufficient for most big data scenarios. If you have a large configured Spark cluster for your big data workload, but when still keep the number of shuffling partitions set to default, it will result in slow performance, data spills and some of the worker's cores not being utilized at all (but you will still be charged for their provisioning).

So, in order to increase the number of shuffling partitions, i.e. split your processing data into smaller data files, you will need to increase the number of partitions. The formula for this is pretty easy:

Number of Shuffling Partitions = Volume of Processing Stage Input Data / 128Mb

Some people, also recommend to keep Shuffle Stage size between 128 Mb and 200 Mb, but not more than that.

4) Use Auto Optimize for your write workload
The less small files (files size less than 128Mb) you have in your data lake to support your delta tables, the better your performance will be when you attempt reading data from this table - https://docs.microsoft.com/en-us/azure/databricks/delta/optimizations/auto-optimize 



With those setting enabled either within your current Spark session:
spark.databricks.delta.optimizeWrite.enabled true
spark.databricks.delta.autoCompact.enabled true

or by design in your delta tables:
ALTER TABLE [table_name] SET TBLPROPERTIES (delta.autoOptimize.optimizeWrite = true, delta.autoOptimize.autoCompact = true)

then the number of data files will be reduced but still with support the aim of optimal performance.

Also, I can execute the Optimize command manually for a particular table:
OPTIMIZE [table_name]
https://docs.microsoft.com/en-us/azure/databricks/spark/latest/spark-sql/language-manual/optimize

5) Clean up your files with the Vacuum command
If your data optimization was successful for your existing tables, then you may get the current optimized set of new data files along with a bigger number of old files that used to support your delta table in the data lake before your data optimization. In this case, the Vacuum command will be your friend to remove data files that are no longer used by your delta tables which still consumes disk space in your data lake storage. Those unused files may also result from other data updates/inserts (upserts) operations.
VACUUM [table_name] [RETAIN num HOURS]
https://docs.microsoft.com/en-us/azure/databricks/spark/latest/spark-sql/language-manual/vacuum

Again, this blog post is only meant to be considered as a resource of helpful notes and definitely not as a complete set of steps that you can undertake to optimize your workload in Azure Databricks. There are other resources available.

For me, this just helps to have one more additional point of reference to know and remember how to optimize my workload in Azure Databricks :-)

Comments