Jumpstarting Data-Centric AI
The Tools to Put Foundation Models to Work
As more enterprise organizations have recognized the utility of artificial intelligence technology, there’s been a major push to invest in and adopt new AI and ML infrastructure to drive insights and make predictions for businesses.
However, many of these solutions lack the mechanism to unlock and operationalize the data needed to train and deploy models for high quality AI projects. That pain point spawned the creation of Snorkel AI, which has developed an end-to-end data-centric machine learning platform for the enterprise.
Their flagship product Snorkel Flow has enabled large customers to quickly build and iterate from unlabeled data sets to high quality machine learning deployed in production. In recent years, the rise of large language (or foundation) models has significantly opened up opportunities for building AI applications, but most organizations don’t have the tools they need to actually put these models to use. To address that gap, Snorkel AI has evolved its product with the launch of their Data-Centric Foundation Model Development, which provides enterprise organizations with the capabilities to incorporate foundation models in their workflow.
“Big foundation models are great at generative and exploratory human loop processes. They’re great at generating text and images, et cetera, but when you actually want to adapt them to predict or automate something at high accuracy with guarantees of performance, you need adaptations (or most commonly) some kind of fine-tuning or prompting,” says Ratner. “Doing this for complex, real production use-cases usually requires labeled training data and the constant iteration and maintenance of that, and that’s what we’re focused on.”
Putting the capabilities to build impactful AI in the hands of more people has been Snorkel’s goal since its inception. The company spun out of Stanford’s AI Lab in 2019 and has been partnered with Greylock since 2020. Alex joined me on the Greymatter podcast to discuss the company’s journey, the evolving world of AI, and his vision for the future.
You can listen to our conversation at the link below or wherever you get your podcasts.
EPISODE TRANSCRIPT
Saam Motamedi:
In the time since launching, AI has advanced significantly, and Snorkel AI has likewise evolved to meet the expanded needs of enterprises to get up to speed on the latest machine learning approaches.
The latest addition to their platform, Snorkel Flow released in November enables enterprises to put foundation models to use. Today we’re going to talk about what that looks like in practice, and I’m pleased to welcome Snorkel CEO and co-founder Alex Ratner. Alex, thanks so much for joining me on Greymatter.
Alex Ratner:
Thanks so much for having me, Saam.
SM:
Alex, as you and I often talk about, AI is a very dynamic and fast-moving field. Even since you launched Snorkel AI, we’ve seen a lot of change. Let’s start by just putting everything in context. Where are we today in AI and ML, and how do you characterize Snorkel AI’s role in its adoption?
AR:
It’s indeed a fast-moving space and it’s exciting every day. A lot of where we started – and we’ll get back into this – is from this shift that you talked about from what we’ve called model-centric to data-centric AI development. And I’ll start there. And it’s still obviously where we think a lot of the core focus deserves to be. At a high level, this is an idea that the pain points (or the blockers to AI development and deployment) are. Or, you could more optimistically say these are actual areas where an AI developer can productively iterate. It used to be all around the models: picking out features, building custom architectures, building bespoke infrastructure; all of that’s what we call model-centric development. And data and training data (the data that models learn from) used to be seen as a second-class citizen. I sometimes call this the Kaggle era of machine learning, where a machine learning developer’s journey started by downloading a dataset that was nicely labeled and curated and then trying to train their model on that.
Fast forward to today, a lot of the machine learning technologies and models have just leapt forward. We’ll talk about even the recent progress over the last couple months around foundation models in a second. But really over the last couple of years they’ve become more powerful, more push-button, more automated, more standardized, more commoditized to the point where a state-of-the-art model’s a couple lines of Python code and an internet connection to get it going (if you have the data).
So the trade off over the last bunch of years – certainly the last seven or eight that we’ve been working on this data-centric AI movement out of Stanford and now the company – the game has really shifted towards the data and how you label and curate it to teach machine learning models.
The reality in most enterprises that we work with (the top 10 U.S. banks and government agencies and Fortune 500 healthcare systems) is that if you want to actually build a machine learning model for something, the balance of effort or time might look like a day to get the model and maybe several months to label the data to teach that model.
That is the trend that we started with; that the field has been going under this shift from model to data-centric development and it’s still the key thing that we see enterprises struggling with and that we address.
Now, on top of that, there have been some really exciting developments recently around what are often called large language models – or I’ll call them foundation models in this chat. Partially out of loyalty to my co-founder Chris, I know some of you know very well and he’s one of the co-founders of the Stanford Foundation Model Center and partly because I think it’s actually an appropriate name.
These foundation models are big self-supervised models. If you’re a machine learning nerd like me or you have machine learning nerd friends anywhere in your Twitter network graph, you’ve probably seen demos that are really incredible about these models learning from scratch to generate text or answers to things or images that are quite amazing. The question on the heels of all this amazing progress and these foundation models scaling up that we ask is, “How does this actually connect to providing production value for our customers?” And the answer right now is it doesn’t in most places that we see. All this ability to generate exciting text and image doesn’t really translate to enterprise automation – and we’ll get into this more – but we see that the data and the data center development as the bridge to connect that, and that’s the core of what we’re announcing today, we’ll get into more in our discussion.
SM:
Awesome. I want to get into the role of Snorkel AI and the [the impact that] new product you’re announcing today is going to have in actually putting these foundational models to use.
But I want to step back for a moment and ask you to spend a couple minutes just motivating foundation models. I’m sure all of us have played around with ones, whether it’s GPT-3 on language generation, or models like DALL-E on image generation. I know the first time I used DALL-E it felt like magic. I never was good at drawing, so it felt fun to actually be able to create interesting creative assets.
And so I think we taste the power of them yet at the same time there’s a question of, “How do we go from these really cool demos to these things actually changing the ways we work and live and operate?” And so maybe spend a couple minutes on that, Alex, and what’s hype, what’s real?
AR:
First of all, I want to plus-one that excitement about these models. I also have spent some time playing around with these multimodal models that can generate amazing images and the text-based ones. How they work is not fundamentally surprising if you’ve been watching the space. We’ve used, I guess what now need to be called medium language models, but the same fundamental architectures and types of models in Snorkel Flow for years now. Things like BERT, DistilBERT, et cetera are some examples in text that many of us run into and we use in platform and support in platform.
But the degree to which they’ve scaled up based on the increasing amounts of data compute engineering work and the results are really amazing and exciting for the field. It’s a really exciting time, even on the academic side, I could say with my academic hat, seeing this shift from a very toy theoretical view of machine learning in many places to studying properties of these gigantic foundation models that we barely yet understand. It’s an exciting time to be in machine learning.