See how Adaptiv can transform your business. Schedule a kickoff call today

Unite everything in your digital ecosystem so you can achieve better business outcomes, faster.
Harness the power of data to drive innovation and growth.
Build new integrated solutions in the cloud that leverage resources on-premises and cross-cloud.
All-in-one data analytics platform that makes it easier for businesses to manage their entire data lifecycle.
MuleSoft’s Anypoint Platform is the world’s leading integration platform for SOA, SaaS, and APIs.
Define, document, discover, and stream events across cloud, on-premises and IoT environments.
Turn raw data from multiple sources to interactive dashboards that can be shared across teams

Technical

Delta File Format: Bridge gaps in Data Lakes & Warehouses

Technical
Thought Leadership
Databricks

Thierry Barnay 7 min read. Dec 18

Contents

Why old school data lakes were more like data puddles

The Game-Changer

The Data Lake House: Where your data lives in luxury

About the Author

Why old school data lakes were more like data puddles

After working for many years on building Data Warehousing solutions on-prem, I wondered how a Data Lake could help businesses design and implement a Data Warehouse on the cloud.

Well, if you were building a Data Warehouse (DW) using the good old Kimball approach, you were in for quite a complex ride. Why? Simply because Data Lake file formats lacked the ability to perform 2 elementary things from the database world:

1) UPDATE, MERGE, Delete like in ANSI SQL

2) ACID (atomicity, consistency, isolation and durability)

Without those two basic capabilities, things like building Type 1, or 2 dimensions were out of the equation. Why?

How could I/You update a field in a dimension? How could I retain a surrogate key? How would I rewrite all the Foreign Keys in my Fact tables?

Without the ability to UPDATE or MERGE, building type 1 or 2 dimensions seemed like a mission fit for John Rambo. But here’s the twist; even Rambo would find this task challenging without the right tools. Sure, with brute force, you might manage to rebuild the dimensions and fact tables from scratch during a daily batch load. Or opt for creating a wide column table, but let’s face it – that’s a resource-intensive operation even Rambo would think twice about.

The Game-Changer

Then, about 2 years ago, when working on building data lakes as landing zones for cloud-based Enterprise Data Warehouse (EDWs), came the news! Image of Data Warehouse, Data Lake, and Lakehouse, Delta Lake

Could that be the tool to bridge the gap between traditional EDWs and Data Lakes?

Two years later, and maybe thousands of implementations globally we can definitely affirm that the gap has been bridged with the release of a new file format: delta.io

Delta brings a host of powerful features to the table. First and most importantly, it supports the ACID properties of transactions, ensuring atomicity, consistency, isolation, and durability of your table data. This is a game-changer for data integrity. Secondly, Delta offers scalable metadata handling with Spark, allowing you to vacuum metadata even on a petabyte scale. This is crucial for managing large datasets efficiently.

But that’s not all. Delta also unifies both streaming and batch data use cases, offering a versatile solution for different data processing needs. It enforces schema rules and allows for schema evolution, providing both stability and flexibility in your data architecture.

One of the most exciting features is time travel. Every operation on a record is automatically versioned in Delta, enabling not just rollback options but also a historical audit trail for your data. This is invaluable for tracking changes and maintaining data quality over time.

Delta also supports merge, update, and delete operations, which are essential for implementing Change Data Capture (CDC) and Slowly Changing Dimension (SCD) methodologies. These features make Delta incredibly versatile for a wide range of data management tasks.

You can find more information here

The Data Lake House: Where your data lives in luxury

delta lake

With the introduction of the Delta file format, we’re not just talking about data lakes or data warehouses anymore; we’re talking about a new hybrid: the Data Lake House. Imagine the best of both worlds— the scalability and raw data storage capabilities of a data lake combined with the structured querying and reliability of a data warehouse. That’s your Lake House. It’s a revolutionary concept because it breaks down the barriers between structured and unstructured data, allowing businesses to have their cake and eat it too. You can run real-time analytics and machine learning models directly on raw data without the need for cumbersome ETL processes. In essence, the Lake House offers a unified, cost-effective, and incredibly flexible data architecture that’s poised to become the new gold standard in data management.

Here we are dealing with an OPEN SOURCE file format which unifies the big and the small together, a bit like the String Theory in Modern Physics which intends to find a common model explaining at the same time Quantum and Gravity.

The beauty of the Delta format is that it is a file format with which you could start building your own Lakehouse on the Cloud Service Provider CSP) of your choice as it is now widely supported natively by Azure Data Factory, AWS Glue (3.0), and GCP Dataproc.

I’ll keep the best for the end, instead of using native services from your CSP, you could sit above their Spark Serverless layer and work directly with the tools from the creators of Spark and Delta: Databricks.

Databricks is the de facto Lakehouse platform on the market today. They created Delta and launched it as open source, how about that? Within the Databricks Lakehouse Platform, for once we have a unified environment where the whole Data and Analytics can collaborate and create together. Why? Because the environment has been designed from the ground up to accommodate 3 personas: Data Engineers (building data pipelines), Data Analysts (finding insights in the data) and Data Scientists (solving complex questions with data).

Link to Article: Databricks launches delta lake

So now with one file format unifying lakes and warehouses, we also have one platform unifying the whole Data team as one.

So there you have it, folks. Delta isn’t just an airline or a river; it’s the future of data management. And if you’re looking to build your own Lakehouse, you know who to call…

Thierry Barnay

Follow on Linkedin

Thought Leadership

Databricks as the foundation layer

Databricks is more than a data platform. Learn why treating it as your foundation layer helps build trust, speed up change, and support analytics and AI at scale.

Thought Leadership

Building the foundations for AI

AI success depends on more than models and technology. Learn why trust, time-to-change, and cost-to-serve are the foundations that determine whether AI delivers lasting business value or becomes an expensive distraction.

Thought Leadership

Data strategy vs data reality

Almost every organisation has a data strategy, but far fewer have the trusted data, governance and operational foundations needed to deliver it. Learn why the gap between strategy and reality exists - and how to close it.

Delta File Format: Bridge gaps in Data Lakes & Warehouses

Why old school data lakes were more like data puddles

The Game-Changer

The Data Lake House: Where your data lives in luxury

Thierry Barnay

Ready to elevate your data transit security and enjoy peace of mind?

Related Articles

Talk to the Team +64 (0)9 2806675

Empower your team with our integration solutions

Integration

Data, Analytics and AI

Trusted by some of Australasia's biggest companies

Sectors

We use leading technologies to minimise risk

See how our customers turn insights into action

Featured Case Studies

Programmed centralises 18,000+ vendors with a trusted master data foundation

Citycare Property cuts onboarding times and streamlines HR processes with Boomi

PGG Wrightson and Adaptiv: Sowing Integration, Reaping Efficiency

Delta File Format: Bridge gaps in Data Lakes & Warehouses

Why old school data lakes were more like data puddles

The Game-Changer

The Data Lake House: Where your data lives in luxury

Thierry Barnay

Ready to elevate your data transit security and enjoy peace of mind?

Related Articles

Talk to the Team +64 (0)9 2806675