Faster, Easier, and Cheaper Data Lakes

The best businesses know how to use data to their advantage, leveraging it as a driving force for innovation and decision making. Yet as companies expand and data volumes grow, they often struggle with their data architecture. Data warehouses – which are great to process and transform structured data for advanced querying and analytics – are costly and can’t scale. And data lakes – which are great for large volumes of mixed or unstructured data – can be difficult to manage and sort through unless you’re a skilled data engineer or data scientist. 

Vinoth Chandar faced this problem first hand while working as a software engineer at Uber. He needed the performance of a warehouse and the scale of a data lake in real-time. So Vinoth created Apache Hudi to implement a new architecture, where the core warehouse and database functionality was directly added to the data lake. He was a pioneer in the technology that today is known as the “lakehouse.”  

The lakehouse architecture is a gamechanger for enterprises for several reasons. It decreases administration time and effort compared to maintaining both a data warehouse and a data lake. It is also a single source for workloads across data science, machine learning, and SQL and analytics, meaning there’s less unnecessary data movement and redundancy. A lakehouse also gives you direct access to data, reducing staleness and latency. And finally, it’s a far more cost effective way to store and process data. 

Given this, it’s no surprise that Apache Hudi, which was open sourced in 2017, has seen incredible success with startups and large enterprises alike.  Thousands of organizations across the world – from Amazon, Disney+ Hotstar, Robinhood, and TikTok – have contributed to the Apache Hudi community and project. The open source project has grown to nearly 1 million monthly downloads, and at Uber, Hudi continues to ingest more than 500 billion records every day. 

I am excited to share that we have co-led the seed investment in Onehouse with our friends at Addition. Onehouse leverages the unique capabilities of Apache Hudi, to offer a cloud-native managed lakehouse service. Onehouse makes data lakes easier, faster and cheaper. Instead of creating yet another vertically integrated data and query stack, it provides one interoperable and truly open data layer that accelerates workloads across all popular data lake query engines like Apache Spark, Trino, Presto and even cloud warehouses as external tables. 

I first met Vinoth when Onehouse was just an idea, and helped him iterate on what it could look like to build on top of the success of Apache Hudi. I was immediately impressed with Vinoth’s thinking around delivering the new data architecture on top of the open source success. And the more I got to know him, I was struck by both his technical leadership and his understanding of customer data problems, and how his experience at Uber could translate into the entire market. 

As I mention in my analysis on Open Source vs. Cloud Castles, we are increasingly seeing open source startups built in (and for) a cloud-native ecosystem. And as a result, these new startups are making a market impact far quicker than those from earlier generations, because they are combining the low friction distribution of a cloud service with the open nature of open source communities to reach developers. Onehouse is the latest startup doing just that, and I’m thrilled to be part of their journey.