Words into Action

Recent breakthroughs in large language models have given rise to a cadre of truly awe-inspiring artificially intelligent tools. Following years of research and experimentation with large language models, the production and deployment of AI tools like GPT-3 signifies a new chapter for the field. Equally impressive is the rate at which these tools have been adopted and, in many cases, are serving as the foundation from which to build even more advanced AI products.

Now, technologists believe the ability for AI tools to move beyond interpreting (and responding with) increasingly sophisticated language is closer than ever. By incorporating many more forms of knowledge that enable the generation of original intelligence, AI is poised to transform the way we live and work.

“We’re going to end up in a world where a lot of humanity’s knowledge is going to get encoded in various different foundation models for many different things,” says Adept CEO and co-founder David Luan. “That’s going to be really powerful.”

Luan, whose ML research and product lab company is building general intelligence via a universal AI collaborator, recently joined me and Stanford computer science and statistics professor Percy Liang to discuss how advancements in large language models are paving the way for the next wave of AI.

This interview took place during Greylock’s Intelligent Future event, a day-long summit hosted by myself and fellow Greylock general partner Reid Hoffman. The summit featured experts and entrepreneurs from some of today’s leading artificial intelligence organizations. You can listen to this interview below or wherever you get your podcasts. You can also watch the video from this interview on our YouTube channel here.

Greymatter by Greylock · Adept Co-Founder David Luan & Stanford Prof. Percy Liang | The State of AI Language Models

Episode Transcript

Saam Motamedi:
Okay. David, Percy. I’m excited about this.

For those of you in the audience who aren’t familiar with these two gentlemen, Percy is the professor of computer science and statistics at Stanford, where, among other things, he’s the director for the Center for Research on Foundation Models, and David is one of the co-founders and CEO of Adept, an ML research and product lab building general intelligence by enabling humans and computers to work together. And before Adept, David was at Google leading a lot of large models efforts, and before that at OpenAI. And we’re fortunate to get to partner with David and the team at Adept here at Greylock.

Percy, David, thank you guys for being here and for doing this. So I want to start at a high level and just start with the state of the play. There’s a lot of talk about large models, and it’s easy to forget that a lot of the recent breakthroughs and models that we’re all familiar with like DALL-E and GPT-3 are actually fairly recent. And so we’re still in the early innings of these models running in production and delivering real concrete customer and user value.

Maybe just give us the state of play, David, starting with you, where are we with large scale models and what’s the state of deployment of these models today?

David Luan:
Yeah, I think the stuff is just incredibly powerful, but I think we’re still underestimating how much there is left to run on this stuff. It’s still so incredibly early. Just take a look at a couple different axes. When we were training these models at Google, it became incredibly clear up front that you could basically take a lot of these hand-engineered machine learning models that people had been spending a lot of their time building, rip it out with this giant model, give it some fine tuning data, and turn it into a smaller model again and serve it, and that would just end up outperforming all of these things that people had done in the past. And so the fact that they’re able to improve existing things that companies are already using machine learning for, but also just how great it has been as a way to be able to create brand new AI products that couldn’t exist before.

It’s fascinating to me to watch things like GitHub Copilot and Jasper and stuff like that, that just hit a nerve so fast and go from zero to hero in terms of adoption. And I think we’re just in the very early innings of seeing a lot more of that. So I think that’s axis one.

I think axis two, too, is just that primarily what we’re talking about so far has been language models, but there’s so many other modalities, sources of human knowledge, all of this stuff. What happens when it’s not just predicting the next token of text, it becomes about predicting all of those other different things. And we’re going to end up in a world where a lot of humanity’s knowledge is going to get encoded in various different foundation models for many different things, and that’s going to be really powerful as well.

Percy Liang:
Yeah, I want to highlight that I agree with everything that David said. I want to emphasize one distinction he made, which is this: Already, with all the applications out there, these foundation models can just lift all boats and just make all the numbers go up.

I think another thing – which is even more, I think, exciting – is that there’s a huge sea of applications that we’re not maybe even dreaming of because we’re stuck in this paradigm where, what is ML? Well, you could gather some data, you train on it. But with prompting and all these other zero shot capabilities, I think you’re going to see a lot more new types, so I think we should be looking not just for how to make faster horses or faster cars, but new types of applications.

SM:
Personally, maybe to follow up on that, I totally agree, and I think it connects to David’s point around something like Copilot. And the thing that’s amazing to me about something like Copilot is both how new of an experience it is and how quickly it’s taken off and gone into end user adoption. What are some of the other areas that you’re looking forward to and are excited about in terms of net new applications that become possible because of these large models?

PL:
Yeah. So I mean maybe one general category you can think about is creation, so this includes codes, text, proteins, videos, PowerPoint slides, anything that you can imagine humans doing right now, which could be a creative or more task oriented activity. You could imagine these systems helping you in the loop, taking you much farther and giving you many more ideas.
So I think the space is quite broad, and I think underscoring the multimodal aspect of this, which David had touched on, is really important. Right now we have language models and we have code models and we have image models, but think about things that you could do when you mix these together, creating different illustrated books or films or things like that.

I think one thing that you have to deal with is the long context dependence. I mean, relatively, right now you’re generating single images or texts up to maybe 2,000 or 8,000, depending on our model, tokens, but imagine generating full films. That’s going to require pushing the technology farther. But we have the data, and if we can harness that and scale up, then I think there’s a lot of possibilities out there.

SM:
David, what would you add? I mean, at Adept you guys spend a lot of time thinking about how to use these models to unlock new ways of collaborating with computers and software. I’m curious what some of the use cases you think about are.

DL:
So I think the thing that I’m most excited about right now is that I think all the creativity use cases Percy just highlighted are going to be extremely powerful. But I think what’s fascinating about these models is if you ask these generative models to go do something for you in the real world, they kind of just pretend like they’re doing something because they don’t have a first class sense of what actions are and what affordances are on your computer.

So the thing that I’m really excited about in particular is [this concept of] how do we bridge this gap? How do we train a foundation model of all of the actions that people take on a computer? And I think once you have that, you have this incredibly powerful base for being able to turn natural language into any sort of arbitrary complexity thing that you would then do on your machine.

SM:
So maybe if we take something like actuation as a key net new capability, or we take longer contexts as an important net new capability, I think the form of the question is where do we still need to see key research unlocks, and where are the key areas of focus to actually make these products a reality?

PL:
I mean I think there are maybe two sides of things. One is pushing up capabilities, and one is making sure things are pushed up in a way that’s robust and reliable and safe. So the first one is in terms of scaling. If you think about video and the ability to scale to hundreds of thousands of sequence lines, I mean, I think you’re going to have to do something different. The transformer architecture has gotten us surprisingly far, but you need to do something different there.

And then, David mentioned this briefly, but I think these models are still in some ways chatbots. They give you the illusion that there’s something going on, and I think in certain applications this is actually okay if there’s another external validity check on things and with the human in the loop doing things.

But I think there’s a deep fundamental research question on how to make these models actually reliable, and there’s many strategies that people have tried using reinforcement learning or using more explanation based or retrieval augmented methods. But I feel like there’s still something deeper missing, and I think that this is one thing I hope the academic community and researchers will work on to ensure that these foundation models have good and stable foundations, as opposed to shaky ones.

DL:
Yeah, agree with a lot of what Percy just said. I think I would just add that I think the default path that we’re on is increasing scale and increasing data, and I think that will continue to lead to a lot of gains. The question becomes how do we pull forward the future faster? And I think that there’s a lot of different things that we should be thinking about.

I think one is specifically on the data side. Later on I’d be curious, at dinner and stuff, to understand from the audience how many people would agree that actually I think we’re much more constrained on data than we think.

I think within the next couple years everyone’s going to have – just to take on language as an example – plus or minus 20% quality, similar number of tokens, web crawl as anybody else. So then the question becomes where next? So I think that’s a really important question. I think we have another important question when it comes to what does true creativity mean? I feel like to me true creativity means being able to discover new knowledge, and I think the new knowledge discovery process, at least for foundation models as we’re training out today, as we actually get better at training these models, that actually just better models the training distribution. And so I think giving these models the ability to gather new information and be able to try out things, I think is also going to be really key. And finally, I think on the safety side, we have a lot, lot more to invest. A lot more questions there we have to go answer.

“The default path that we’re on is increasing scale and increasing data. Now, the question becomes how do we pull forward the future faster?”

SM:
So let’s get to safety in a moment.

Continuing on data, because I think that is a really important topic here, David, at Adept you all are thinking about how to build products that humans collaborate with, and I think one of the nice consequences of that becomes this data flywheel. Can you maybe add a little bit about how you’re thinking about that and how you’re approaching designing products that end users will work with?

DL:
Yeah, I think that it starts out with having a pretty crisp definition of what we want the end game to look like. And I think for us, we want to be building teammates and collaborators for people, like a series of increasingly powerful software tools that help humans increase the level of abstraction at which they can interact with their machine. To do a different analogy, it doesn’t replace the musician but it gives musicians synthesizers, that kind of analogy except for doing things on your computer.

I think because that’s where we want to go I think what’s really important to us is how do we solve these HCI problems where it really feels like you’re working together with the machine, at the same time using that as an opportunity for us to be able to learn from basically how humans break down really complicated problems, how humans actually get things done. That may be part of things that are much more complicated than trajectories you might just be able to see on the internet.

PL:
Just add something to that. I think the interaction piece is really interesting here because these models are in some ways the most interactive ML models we have. You have a playground, you type in a prompt and you immediately get to play with a model, as opposed to a previous cycle where someone gathers some data, trains a model, and then you experience it from the user. So the line between developer and user is actually interestingly getting kind of blurred, which I think is actually a good thing because if you can connect these two up, then you get a kind of better experience.

SM:
Is there anything interesting from both the HCI perspective and the foundation models perspective on the research side that you all are working on around interaction?

PL:
Yeah, so one thing that we’ve been doing at Stanford, as a part of a larger benchmarking effort, trying to understand what it means for humans to interact with these models, because the classic way that people think about these models is you train these models and then there is a hundred benchmarks and you evaluate, and this is taking the automation approach. But as we know, a lot of the potential here is in Copilot or autocomplete kind of experiences where there is a human in loop and humans, and Adept I think is also a good example of this.

And what does that mean? Should we be building our models differently if we know that humans are going to be in the picture, as opposed to you’re doing full automation. And that’s an interesting thing because maybe in some cases you want a model not to just be accurate, but you want it to be more interpretable or more reliable or understandable, and for creative applications you may want a model to actually have a broader distribution of outputs. And we’re seeing some of this where what is good for actual interaction is not necessarily what’s good just for the standard benchmarks. So that’s really interesting.

SM:
How’s that going to get resolved? I’m thinking about a lot of classical machine learning applications, again, even there it’s still hazy, but there’s some point of view on benchmarks standards. There are different products out there that can actually measure these things around bias and auditing. As we massively blow up the scope around creativity, all of that kind of shifts, so how do you think this is going to resolve?

PL:
Yeah, so first order, scale definitely is helping. So we’re safe on that. If you scale up the models I think it lifts all boats. And given a particular scale, then you have a question of where you’re investing your resources. I think what we want to do is develop effective surrogate metrics, which you can actually evaluate which correlate well with human interaction. We don’t really have a good handle on this quite yet, but having humans in the loop for an inner loop is also potentially problematic and hard and not reproducible. So you want something that’s easy to evaluate, but at the same time that’s actually tracking what you care about.

SM:
So I want to shift to building products and companies around large scale models. And David, maybe I’ll start with you. There are people in the audience who are in the early stages of building these companies and one fundamental question is, okay, do you go build on top of an OpenAI API? Do you build on something in the open source? Do you go build your own large model? How do you think a founder should navigate making that decision?

DL:
I think this is probably the biggest question for people to ask right now. I think the root thing that I think is worth answering first is what is the loop you’re going to run for your company to compound? Is it going to be oriented towards really deeply understanding a particular customer use case? Is it going to be oriented towards some sort of data flywheel that you’re trying to build?

I think the general thing here is that thinking about how that interfaces with the differentiation that you want to have as a business is going to be really key, because I think the world that I don’t think we want to live in is one where effectively these companies become sort of outsource customer discovery engines and then new Amazon basics versions of these things come out over time. That would not be a particularly good world to live in. So I think figuring out what that compounding looks like is the most important first step.

I think the other thing to think about here is just how many nines do you need? If you need a lot of nines of reliability, I think one thing that’s really, really difficult is you just lack all the affordances that you could possibly want if you are consuming this through an intermediary to get you to where you want to be with your customers. So I think that because of those different reasons you could end up choosing a very different point in space for how you want to ultimately consume these services.

PL:
Yeah, maybe just to add one thing is that one nice thing about having these APIs is that it is extremely easy to get started and try something. You can sit down in an afternoon, you punch in some data, and you can get a sense of the possibilities. In some cases it’s sort of a lower bound on how well you can do because you spend an afternoon and if you invest in more and if you fine tune and build server custom things and can only get better, in some sense. So that I think has opened up a lot of it, the challenges to even formulate what is the right problem to go on.

And typically you don’t know, and you have to collect data, and then you have to train a model and then that loop becomes very expensive, but you could just sit down in the afternoon, try a few things, and maybe few shot your way to something that’s actually reasonable. Now that kind of gets you into a different part of the space and you can iterate much faster.

SM:
Yeah, it makes a lot of sense in terms of prototyping quickly and trying to take out product market fit risk. One question becomes, and Percy, I’m curious for your take on this, if you start that way, how do you over time build durability into your product? Because I could make the argument, hey, maybe you’re just a thin layer on top of someone else’s API. You can quickly de-risk product market fit, but is there a real durability in your layer of the stack?

PL:
Right. Yeah, I mean think, yeah, the transition out of API is a very discrete one, in some sense. People also do Wizard of Oz experiments. You put a human there, and you have the human do it, and then you work out all the interface issues and whether this makes sense at all, and then you try to put something else, take the human out, and now you could put an API there and you could get a sense of what things are like. And then, in some cases, maybe future learning is for some things actually not that strong if you have, for example, data, and maybe a fine tune T5 model or something much smaller can actually be effective. And I think the last thing on your mind should be, “Let’s go pre-train a 500 billion parameter model,” when you don’t know what application you’re building.

SM:
Maybe continuing on the theme of building on top of these models, despite the magical qualities of these things, is that there’s still limitations. One of the limitations is falsehoods, and there are others that I think developers need to navigate as they think about building these applications. David, maybe starting with you, what do you think some of the key limitations are and how do you guide people around navigating those?

DL:
That’s a really good question. I think falsehoods are definitely a very interesting thing to go talk about. These models love to be massive hallucination engines, and so getting them to stick to the script can be quite difficult. I mean I think in the research community we’re all aware of a bunch of different techniques for improving that from things like learning from human feedback to potentially augmenting these models with retrieval and such.

I do think that, on the topic of falsehoods in particular, this idea of packing all of the world facts into the parameters of a model is pretty inefficient and somewhat wasteful, especially when some of those facts change over time, like who may be running a particular country in a particular moment. And so I think it’s pretty unlikely that’s going to be the terminal state for a lot of these things, so I’m really excited for a lot of the research that’ll happen to improve that.

But I think the other part actually goes back to a question of practicality and HCI, which is that you have a sense of every year we’re pushing fundamental advancements on these models, they get somewhat better at a wide variety of different tasks that are already show receptivity to scale and receptivity to more training data examples. But how do you surf this wave where the particular capabilities you’re looking for from the model are good enough to be deployed where you can learn how to get from there to the finish line? And how do you work around some of these limitations in the actual interface to these models, such that it doesn’t become a problem for your users to use? I think that’s actually a really fascinating problem.

“How do you surf this wave where the particular capabilities you’re looking for from the model are good enough to be deployed where you can learn how to get from there to the finish line?”

PL:
Yeah, I mean these models I think are exciting. And the flip side is that they have a ton of weaknesses in terms of falsehoods, generating things that are not true, biases, stereotypes, and basically all the good and bad and ugly of the internet gets put into it.

And I think it’s actually much more nuanced than just, “Let’s remove all the incorrect facts and de-bias these models,” because there are efforts on filtering. You can filter out offensive language, but then you might end up marginalizing certain populations. And what is the truth? We like to think there’s a truth, but there’s actually a lot of text which is just opinions, and there’s different viewpoints and a lot of it’s not even falsifiable, so you can’t talk about truth of value, even falsifiability.

And if there’s some applications, for example, creative cases where you do want maybe things which are a bit more edgy. How you create fiction, for example, and if everything has to be true. There’s no easy way to, even if you could, throw out all the kind of “bad stuff”.

So I think one thing that maybe is a good framework to think about is that there’s no one way to make these models absolutely much better. I think that what you want is control and documentation. You want to be able to understand, given a model, what it is capable of, what it should be used for, and what it should not be.

And this is tricky because these models have huge surface capabilities. You just get a prompt. You can put any string and get any other string back. So what am I supposed to do with it? What are the guarantees? And I think as a community we need to develop a better language for thinking about what are the possible inputs and outputs, what are the contracts, the things that you have in traditional APIs and good old fashioned software engineering? We need to import some of that so that downstream application developers can look at it like, “Oh yeah, okay, I’ll use this model, not that one, for my particular use case.”

SM:
There’s another side to this, which for lack of a better word, I’ll use the word “risk”, which is if I’m a product builder and I’m building these models, building different products, there’s different levels of risk I might be willing to take in terms of what I’m willing to ship to end users and the level of guardrails I need in place before I’m willing to ship. How do you guys think about that and frameworks around that?

DL:
So I think there’s some interesting perspectives here specifically around just how you sequence out the different applications that you want to go after. I feel like one really nice property of these models is it’s just so easy to go from zero to a hand-wavy 80% quality on a wide variety of tasks. For some of those tasks, that’s all you need, and sometimes the iteration loop of having that 80% thing with humans is all you need to do to go run it over the finish line.

I feel like right now the biggest opportunity we all have is starting out by first addressing things like that. But I think that over time, actually one of the things that Kevin said that I really liked is that there will be more of a standardized toolkit for how to erase some of the lower hanging fruit risks related to generations that are inappropriate, or models going off the rails in various different ways. I think there’s another set of risks that are slightly longer term that I think are also really important to go think about, and I think those are definitely, definitely much harder.

PL:
Yeah. To build on top of that, I think there’s maybe one category of risk is also adversaries in the world. Whenever you have a product that’s getting enough traction, there’s probably people who want to mess with you. And one example is data poisoning, and this side I think hasn’t really been born out, there’s some papers on it, but if you think about it, these models are trained on the entire whatever, web crawl, so anyone can go put up a webpage or put something on GitHub and that can enter the training data, and these things are actually pretty hard to detect. So if you think from a security point of view, this is a huge gaping hole in your system, and a determined attacker could probably figure out a way to screw over your system.

And the other thing to think about is misuse. These systems, all systems, powerful models like these are dual use technologies. There’s a lot of immense good that you can do with them, but they can also be used for fraud, disinformation, spam, and all the things that we already know it exists but now amplified, and that’s also a scary thought.

“Whenever you have a product that’s getting enough traction, there are probably people who want to mess with you.”

DL:
Yeah, definitely a lot of asymmetric capabilities here, because just hooking up one of these giant code models to some are RL agents to just get into systems. There’s so many things out there that become way easier for malicious actors to do as a result.

PL:
Yeah, attack is so much easier than defense.

SM:
I have a few more questions I want to get through, but just watching the time, I want to first open it up and see if there are questions in the audience.

Audience Member:
This is a question for Percy. From an academic standpoint, what is an ideal wishlist that you might have in ways that corporations and big companies that are building a lot of the ecosystem could help?

PL:
Yeah, I mean I think one big thing is openness and transparency, which is something I think is really sorely missing today. I think if you look at the deep learning revolution, what’s been great about it is that it’s benefited from having an open ecosystem with toolkits like TensorFlow, PyTorch, data sets that are online, and tutorials. And people can just download things, tinker and play with it, and it’s much more accessible. And now we have models which are behind APIs and charging certain fees. And you can’t really tinker as much with it.

And also, at the same time, a lot of organizations are addressing the same issues around, we talked about safety and misuse, but there’s not really an agreement on what is the best practices.

So I think what would be useful is to develop community norms around what is safe or what are best practices for mitigating some of these risks, and in order to do that there has to be some level of openness as well so that when a model is deployed you get a sense of what these models are capable of, and benchmark them and document them in a way that the community knows how to respond, as opposed to, “Here is a thing you can play with. It might shoot yourself in the foot, or it might not. Good luck.”

SM:
We have time for one more question. So maybe I’ll ask a final question, which goes back to creativity. I think one of the important things to inspire in all of us is what’s going to be possible, the magic of these models. And I think DALL-E was an important moment for people to see the type of creative generation that’s possible. I guess for each of you, what are some of the things that you think are going to be possible in the way we interact with these models in a few years that you’re most excited about? Maybe starting with you, David?

DL:
I think language. As we were talking about earlier, language is just the tip of the iceberg. We’re already seeing amazing things just with language, but I think when we start seeing foundation models or even bundled together foundation models at our multimodal for every domain of human knowledge, every type of input to a system or to ourselves that we care about, I just think we’re going to end up with some truly, truly incredible outcomes with that.

I think if I were to choose a personal thing that I’m not working on that I think would be really cool, actually I think Percy and I talked about this once, is what happens when you start having foundation models for robots and when you can take all of these demonstrations, all of the different trajectories of robots interfacing with the real world, and put them all into one particular system, and then have them have the same type of generality that we’ve seen with language. I think that’d be incredible.

PL:
Yeah, I agree with that and I have definitely thoughts, but maybe I’ll end with sort of a different perspective, I mean, well, not a different perspective, but another example which is we get excited about generating an image. It’s just an image, and it’s also something that you may imagine artists being able to do. And if you think about it, humans aren’t changing that much. Computers are. And before a year, or two years ago, we weren’t able to do that. Now we can. And so if you kind of extrapolate. Now, an image, it’s so small, and you think about videos or now you think about 3D scenes or immersive experiences maybe with personas, you could imagine generating worlds in a sense, and that’s kind of scary but also sort of exciting. And I don’t know, there could probably be many possibilities there.

But I think the bigness of things that you can create… I mean if you think about these models as excellent creators of large objects and you think about what are big objects? Well, they’re environments in some sense. And what if we could do that? What would that look like, and what are some applications that could unlock? That would be interesting to think about.

SM:
Yeah, there’s a commonality there, which, again, connects back to multiple modalities and just continuing to push scope. It’s a really exciting glimpse of the future. Percy, David, thank you guys so much for doing this.

DL:
Thank you.

PL:
Thank you.

“If you think about it, humans aren’t changing that much. Computers are.”

WRITTEN BY

Saam Motamedi

Saam partners with enterprise software entrepreneurs at the seed and early stages who are focused on new opportunities in intelligent applications, cybersecurity, AI, and data infrastructure.

Words into Action

Episode Transcript

WRITTEN BY

Saam Motamedi

Subscribe to the Greylock newsletter