Reducing costs of Azure Databricks (FinOps in practice) - SoftwareOne

Blog Editorial Team

A blue background with dots and squares.

Big data is what powers companies. Lots of data mean lots of insights, which enables better decision-making.

Having lots of information helps organisations work more efficiently and come up with new solutions and ways to better help their customers. But there's also the matter of cost - large-scale solutions generate large-scale expenses. The good news is, any savings are large-scale too. Now, how to get them? Skills and knowledge are definitely helpful, but research and determination can be equally important. We will explain it using this real-life example from our recent project.

The quest for reducing cloud spend

Earlier this year, we worked with a company providing airport services, such as cargo handling, ground services, etc. Their customers, i.e. airlines, can book their assistance via a business web portal. This solution is one of the key services for our client, generating lots of data. The organisation had 4 million public requests and 26.4 million messages daily, amounting to 3.42 TB of data per day. With this much data, observability was key. The organisation needed specialised tools to manage, process, and generate insights from this multitude of signals and sources. This is why they decided to introduce a dedicated data platform.

First attempts to reduce cloud costs

Their first implementation of a dedicated service was functional but also pricey. Processing data cost them £1 M per year, just for this one solution. It came as no surprise that the company looked to lower the cost. The revised approach relied on a custom platform built in-house. The solution was based on Event Hub and Databricks. Although this approach made the service slightly cheaper, the cost was still around £700,000 per year.£700,000 per year is not exactly peanuts, so it's not a surprise that the project got a bit of a pushback. The company wanted to get the cost below £60,000. Not an easy challenge but there had to be a way to do it. Which sounded like fun, so we jumped on the chance to tackle it :)

Engage FinOps mode!

We faced a classic FinOps problem: how to ensure the client only uses the resources they need, without overpaying for them? Of course, there are never easy answers. So a difficult one had to do. We were determined to find it. First, we took the approach of understanding the usage. How much data does Databricks store and what happens to it? While processing data, Databricks streams them simultaneously to write storage. As a result, write operations are the highest cost of the platform - and at the same time, the highest variable cost. One way to optimise it was to change the frequency with which Azure Databricks takes a data snapshot. In other words, if the service did a "data autosave" less often, it should reduce the load on the system and consequently, the cost. Sure enough, by processing data at a lower frequency, the expense was marginally reduced. The first attempt was done - but not even close to the target £60k. It was time to dig deeper.

Diving into Databricks

The next idea was to look under the hood and investigate Azure Databricks workflows, to see if they could be optimised. This is where things got really interesting. To explain this, let's first go over how Azure Databricks is set up. Here’s the architecture diagram from Microsoft: Looking at this structure, we tried to figure out where else we might be able to cut down more costs. VMs wouldn't help much here, as the volume of data was more or less consistent. So instead, we turned to DBFS – Databricks File System – which uses blob storage for write operations.

Reducing cloud cost by switching off defaults

Apache Spark, which is the structured streaming engine for Azure Databricks, by default comes with HDFS backend state store implementation. While it works perfectly in most cases, it can sometimes lead to GC (garbage collection) pauses due to overloaded memory. The good news is, for scenarios where HDFS doesn't quite fit, there is an alternative. In Azure Databricks, you can use RocksDB for stateful streaming. With large volumes of data (reaching millions of records at a time) it processes it more efficiently without getting bogged down in read data. We took advantage of this fact and changed the state management system to RocksDB. What do you know - solution cost went down well below target (it's at around 5 digits per year now). Result! What did we learn besides saving the customer TONS of money? Never accept the defaults. The default for Azure Databricks (and some other Spark-based services) is HDFS, where RocksDB rocks for some workloads. Never accept the defaults. Be a challenger!

Cost optimisation at architecture level

In this case, we managed to save the client money by optimising their workload. We reviewed the architecture carefully and chose a different, more efficient in this case, data processing algorithm in Azure Databricks. It made no functional changes to solution output but managed to cut down the client's platform cost by over 90% compared to the original service. If you'd like to know more, the technicalities are described in the documentation linked below. And if you'd rather talk to a human, get in touch with us.

Author

Blog Editorial Team

We analyse the latest IT trends and industry-relevant innovations to keep you up-to-date with the latest technology.