OpenAI Chief Scientist Dr Ilya Sutskever

Informing leadership at Google and Microsoft, as well as decision makers in major governments, Alan’s monthly analysis, The Memo, is a Substack bestseller in 142 countries:
Get The Memo.


Speaker: Dr Ilya Sutskever interviewed by NVIDIA’s Jensen Huang
Transcribed by: OpenAI Whisper via SteveDigital’s HF Space.
Edited by: Alan (without AI!)
Date: 15/Mar/2023 (day after GPT-4 release)


– When we train a large neural network to accurately predict the next word in lots of different texts from the Internet, what we are doing is that we are learning a world model. It may look—on the surface—that we are just learning statistical correlations in text. But it turns out that to ‘just learn’ the statistical correlations in text, to compress them really well, what the neural network learns is some representation of the process that produced the text. This text is actually a projection of the world.

Full edited transcript

Jensen: Ilya, unbelievable. Today is the day after GPT-4. It’s great to have you here. I’m delighted to have you. I’ve known you a long time. The journey and just my mental, my mental memory of the time that I’ve known you and the seminal work that you have done. Starting in University of Toronto, the Co-Invention of Alex Nett with Alex and Jeff Hinton that led to the big bang of modern artificial intelligence. Your career that took you out here to the Bay Area, the founding of OpenAI, GPT-123, and then of course, chat GPT. The AI heard around the world. This is the incredible resume of a young computer scientist, an entire community and industry at all with your achievements. I just want to go back to the beginning and ask you, deep learning. What was your intuition around deep learning? Why did you know that it was going to work? Did you have any intuition that I was going to lead to this kind of success?

Ilya: OK, well, first of all, thank you so much for the quote, for all the kind words. A lot has changed thanks to the incredible power of deep learning. Like I think my personal starting point, I was interested in artificial intelligence for a whole variety of reasons. Starting from an intuitive understanding of appreciation of its impact. And also I had a lot of curiosity about what is consciousness, what is the human experience. And it felt like progress in artificial intelligence will help with that.

The next step was, well, back then I was starting in 2002, 2003. And it seemed like learning is the thing that humans can do, that people can do, that computers can’t do at all. In 2003, 2002, computers could not learn anything. And it wasn’t even clear that it was possible in theory. And so I thought that making progress in learning, in artificial learning, in machine learning, that would lead to the greatest progress in AI. And then I started to look around for what was out there. And nothing seemed too promising. But to my great luck, Jeff Hinton was a professor at my university. And I was able to find him. And he was working in neural networks, and it immediately made sense. Because neural networks had the property that we are learning, we are automatically programming parallel computers. Back then the parallel computers were small. But the promise was, if you could somehow figure out how learning in neural networks work, then you can program small parallel computers from data. And it was also similar enough to the brain. And the brain works. So it’s like you had these several factors going for it. Now it wasn’t clear how to get it to work. But of all the things that existed, that seemed like it had by far the greatest long term promise. Even though you know.

Jensen: the time that you first started, at the time that you first started working with deep learning and neural networks. What was the scale of the network? What was the scale of computing at that moment in time? What was it like?

Ilya: An interesting thing to note is that the importance of scale wasn’t realized back then. So people would just train in our neural networks with like 15 neurons, 100 neurons, several hundred neurons that would be like a big neural network. A million parameters would be considered very large. We would run our models on unoptimized CPU code. Because we were a bunch of researchers. We didn’t know about Blas. We used MATLAB. The MATLAB was optimist. And we just experiment. What is even the right question to ask? You try to gather to just find interesting phenomena, interesting observation. You can do this small thing. You can do that small thing. Geoff Hinton was really excited about training neural nets on small little digits, both for classification. Also, he was very interested in generating them. So the beginnings of generative models were right there. But the question is like, okay, so you got all this cool stuff floating around. What really gets traction? And so that, it wasn’t obvious that this was the right question back then. But in hindsight, that turned out to be the right question.

Jensen: Now, the year AlexNet was 2012. Now, you and Alex were working on AlexNet for some time before then. And at what point was it clear to you that you wanted to build a computer vision oriented neural network that image net was the right set of data to go for and to somehow go for the computer vision contest?

Ilya: Yeah. So I can talk about the context there. It I think probably two years before that, it became clear to me that supervised learning is what’s going to get us the traction. And I can explain precisely why. It wasn’t just an intuition. It was, I would argue, an irrefutable argument, which went like this. If your neural network is deep and large, then it could be configured to solve a hard task. So that’s the keyword, deep and large. People weren’t looking at large neural networks. People were, you know, maybe studying a little bit of depth in neural networks. But most of the machine learning field wasn’t even looking at neural networks at all. They were looking at all kinds of Bayesian models and kernel methods, which are theoretically elegant methods, which have the property that they actually can’t represent a good solution no matter how you configure them. Whereas the large and deep neural network can represent a good solution to the problem. To find the good solution, you need a big data set, which requires it. And a lot of compute to actually do the work. We’ve also made advanced work. So we’ve worked on optimization for a little bit. It was clear that optimization is a bottleneck. And there was a breakthrough by another grad student in Geoff Hinton’s lab called James Martens. And he came up with an optimization method, which is different from the one we are doing now using now. It was some second order method. But the point about it is that it’s proved that we can train those neural networks. Because before we didn’t even know we could train them. So if you can train them, you make it big. You find the data and you will succeed. So then the next question is, well, what data and an image net data set, back then it seemed like this unbelievably difficult data set. But it was clear that if you were to train a large convolutional neural network on this data set, it must succeed if you just can have the compute.

Jensen: And right at that time, you and I are history and our paths intersected. And somehow you had the observation that a GPU, and at that time we had, this is our couple of generations into what could add GPU. And I think it was GTX 580 generation. You had the insight that the GPU could actually be useful for training your neural network models. What was that? How did that day start? You never told me that moment. How did that day start?

Ilya: Yeah. So the GPUs appeared in our lab, in our Toronto lab. Thanks to Geoff. And he said, we should try this GPUs. And we started trying and experimenting with them. And it was a lot of fun. But it was unclear what to use them for exactly. Where are you going to get the real traction? But then with the existence of the image net data set. And then it was also very clear that the convolutional neural network is such a great fit for the GPU. So it should be possible to make it go unbelievably fast. And therefore train something which would be completely unprecedented in terms of its size. And that’s how it happened. And you know, very fortunately Alex Krojevsky, he really loved programming the GPU. And he was able to do it. He was able to code to program really fast convolutional kernels. And then train the neural net on the image net data set. And that led to the result. But it was like the world.

Jensen: It shocked the world. It broke the record of a computer vision by such a wide margin that it was a clear discontinuity.

Ilya: Yeah. And I would say it’s not just like there is another bit of context there. It’s not so much like when you say break the record, there is an important, it’s like, I think there’s a different way to phrase it. It’s that that data set was so obviously hard and so obviously outside of reach of anything. People are making progress with some classical techniques and they were actually doing something. But this thing was so much better on the dataset which was so obviously hard. It was, it’s not just that it’s just some competition. It was a competition which back in the day, it was so obviously difficult, so obviously out of reach. And so obviously with the property that if you did a good job, that would be amazing.

Jensen: Big bang of AI. Fast forward to now. You came out to the valley, you started OpenAI with some friends, you were the chief scientist. Now what was the first initial idea about what to work on at open AI because you guys worked on several things. Some of the trails of inventions and work you could see led up to the ChatGPT moment. But what were the initial inspiration? How would you approach intelligence from that moment and let it this?

Ilya: Yeah. So obviously when we started, it wasn’t 100% clear how to proceed. And the field was also very different compared to the way it is right now. So right now you already used, we already used to, you have these amazing artifacts, these amazing neural nets were doing incredible things and everyone is so excited. But back in 2015, early 2016 when you were starting out, the whole thing seemed pretty crazy. There were so many fewer researchers, like 100, maybe there were, between 100 and 1000 times fewer people in the field compared to now. Back then you had like 100 people, most of them were working in Google/DeepMind and that was that. And then there were people picking up the skills but it was very, very scarce, very rare skill. And we had two big initial ideas at the start of OpenAI that had a lot of stain power and they stayed with us to this day. And I’ll describe them right now.

The first big idea that we had, one which I was especially excited about very early on, is the idea of unsupervised learning through compression, some context. Today we take it for granted that unsupervised learning is this easy thing and you just pre-train on everything and it all does exactly as you’d expect. In 2016 unsupervised learning was an unsolved problem in machine learning that no one had any insight, any clue as to what to do. The unlikely can would go around and give a talk, give talks, saying that you have this grand challenge on supervised learning. And I really believed that really good compression of the data will lead to unsupervised learning.

Now, compression is not language that’s commonly used to describe what has really been done until recently. When suddenly it became apparent to many people that those GPTs actually compress the training data. You may recall the Ted Chiang New Times article which also alluded to this. But there is a real mathematical sense in which training these autoregressive generative models compress the data. And intuitively you can see why that should work. If you compress the data really well you must extract all the hidden secrets which exist in it. Therefore, that is the key. So that was the first idea that you were really excited about and that led to quite a few works in OpenAI. To the sentiment neuron, which I’ll mention very briefly. It is not this work might not be well known outside of the machine learning field, but it was very influential, especially in our thinking.

This work, like the result there was that when you train a neural network, back then it was not a Transformer. It was before the Transformer. Small recurrent neural network, LSTM. Those secrets work. There is some of the work that you’ve done yourself. So the same LSTM with the few twists, trained to predict the next token in Amazon reviews. Next character. And we discovered that if you predict the next character well enough, it will be a neuron inside that LSTM that corresponds to its sentiment. So that was really cool because it showed some traction for unsupervised learning and it validated the idea that the really good next character prediction, next something prediction, compression has the property that it discovers the secrets in the data. That’s what we see with these GPT models, right? You train and people say just statistical correlation. I mean at this point it should be so clear to anyone.

Jensen: And that observation also, for me intuitively, open up the whole world of where do I get the data for unsupervised learning? Because I do have a whole lot of data. If I could just make you predict the next character and I know what the ground truth is, I know what the answer is, I could train a neural network model with that. So that observation and masking and other technology, other approaches, open my mind about where would the world get all the data that’s unsupervised learning?

Ilya: Well, I think so I would phrase it a little differently. I would say that within supervised learning the hard part has been less around where you get the data from, though that part is there as well, especially now. But it was more about why should you do it in the first place? Why should you bother? The hard part was to realize that training these neural nets to predict the next token in is a worthwhile goal at all.

Jensen: It would learn a representation that it would be able to understand.

Ilya: That’s right. That it will be use grammar and yeah. But to actually, it just wasn’t obvious. So people weren’t doing it. But the sentiment neuron work. And I want to call out Alec Radford as a person who was responsible for many of the advances there. The sentiment, this was before GPT-1, was the precursor to GPT-1. And it influenced our thinking a lot. Then the Transformer came out and we immediately went, oh my god, this isn’t the thing. And we trained GPT-1.

Jensen: Along the way, you always believe that scaling will improve the performance of these models. Larger networks, deeper networks, more training data would scale that. There was a very important paper that OpenAI wrote about the scaling laws and the relationship between loss and the size of the model and the DMAW data set, the size of the data set.

When Transformers came out. It gave us the opportunity to train very, very large models in very reasonable amount of time. But the intuition about the scaling laws and the size of models and data and your journey of GPT-1, 2, 3, which came first. Did you see the evidence of GPT-1 through 3 first? Or was it an intuition about the scaling law first?

Ilya: The intuition, so I would say that the way I’d phrase it, is that I had a very strong belief that bigger is better. And that one of the goals that we had at OpenAI is to figure out how to use the scale correctly. There was a lot of belief about in OpenAI, about scale from the very beginning. The question is what to use it for precisely? Because I’ll mention, right now we are talking about the GPT’s, but there is another very important line of work which I haven’t mentioned, the second big idea. But I think now is a good time to make a detour and that’s reinforcement learning. That clearly seems important as well. What do you do with it? So the first really big project that was done inside OpenAI was our effort at solving a real-time strategy game. And for context, a real-time strategy game is like, it’s a competitive sport. We need to be smart, you need to have fast training, you need to have a quick reaction time, there is teamwork and you’re competing against another team. And it’s pretty involved and there is a whole competitive league for that game.

The game is called Dota 2. So we trained the reinforcement learning agent to play against itself, to produce with the goal of reaching a level so that it could compete against the best players in the world. And that was a major undertaking as well. It was a very different line. It was reinforcement learning.

Jensen: Yeah, remember the day that you guys announced that work? This is, for, by the way, when I was asking earlier about, there’s a large body of work that have come out of OpenAI. Some of it seem like detours. But in fact, as you’re explaining now, they might have been detours, seemingly detours, but they really led up to some of the important work that we’re now talking about, that’s GPT.

Ilya: Yeah, I mean, there has been real convergence where the GPTs produce the foundation and a reinforcement learning of Dota morphed into reinforcement learning from human feedback. That’s right. And that combination gave us ChatGPT.

Jensen: You know, there’s a misunderstanding that ChatGPT is in itself just one giant, large language model. There’s a system around it that’s fairly complicated. Could you explain briefly for the audience the fine tuning of it, the reinforcement learning of it, the various surrounding systems that allows you to keep it on rails and let it give it knowledge and so on and so forth?

Ilya: Yeah, I can. So the way to think about it is that when we train a large neural network to accurately predict the next word in lots of different texts from the internet, what we are doing is that we are learning a world model. It looks like we are learning that it may look on the surface, that we are just learning statistical correlations in text. But it turns out that to just learn the statistical correlations in text to compress them really well. What the neural network learns is some representation of the process that produced the text. This text is actually a projection of the world.

There is a world out there and it has a projection on this text. And so what the neural network is learning is more and more aspects of the world, of people, of the human conditions, their hopes, dreams and motivations, their interactions and the situations that we are in and the neural network learns a compressed, obstruct usable representation of that.

This is what’s being learned from accurately predicting the next word. And furthermore, the more accurate you are in predicting the next word, the higher the fidelity, the more resolution you get in this process. So that’s what the pre-training stage does. But what this does not do is specify the desired behavior that we wish our neural network to exhibit.

You see a language model, what it really tries to do is to answer the following question: “If I had some random piece of text on the internet, which starts with some prefix, some prompt, what will it complete to?” If you just randomly end the doubt on some text from the internet. But this is different from what I want to have an assistant which will be truthful, that will be helpful, that will follow certain rules and not violate them, that requires additional training. This is where the fine tuning and the reinforcement learning from human teachers and other forms of AI assistants is not just reinforcement learning from human teachers. It’s also reinforcement learning from human and AI collaboration. Our teachers are working together with an AI to teach our AI to behave.

But here we are not teaching it new knowledge. This is not what’s happening. We are teaching it. We are communicating with it. We are communicating to it. What it is that we want it to be. And this process, the second stage, is also extremely important. The better we do the second stage, the more useful, the more reliable this neural network will be. So the second stage is extremely important too. In addition to the first stage of the learn everything. Learn as much as you can about the world from the projection of the world, which is text.

Jensen: Now you could fine tune it. You could instruct it to perform certain things. Can you instruct it to not perform certain things so that you can give it guard rails about avoid these type of behavior? Give it some kind of a bounding box so that it doesn’t wander out of that bounding box and perform things that are, you know, less safe or otherwise.

Ilya: Yeah. So this second stage of training is indeed where we communicate to the neural network. Anything we want, which includes the bounding box. And the better we do this training, the higher the fidelity with which we communicate this bounding box. And so with constant research and innovation on improving this fidelity, we are able to improve this fidelity. And so it becomes more and more reliable and precise in the way in which it follows the intended instructions.

Jensen: ChatGPT came out just a few months ago. Fastest growing application in the history of humanity. Lots of interpretations about why. But some of the things that is clear, it is the easiest application that anyone has ever created for anyone to use. It performs tasks. It performs things. It does things that are beyond people’s expectation. Anyone can use it. There are no instruction sets. There are no wrong ways to use it. You just use it. And if your instructions or prompts are ambiguous, the conversation refines the ambiguity until your intents are understood by the application, by the AI. The impact, of course, clearly remarkable. Now, yesterday, this is the day after GPT-4. It’s just a few months later. The performance of GPT-4 in many areas astounding. SAT scores, GRE scores, bar exams, the number of tests that is able to perform at very capable levels, very capable human levels astounding. What were the major differences between ChatGPT and GPT-4 that led to its improvements in these areas?

Ilya: So, GPT-4 is a pretty substantial improvement on top of ChatGPT, across very many dimensions. We trained GPT-4, I would say, between more than six months ago, maybe eight months ago, I don’t remember exactly. GPT-4 is the first big difference between ChatGPT and GPT-4. And that’s perhaps the most important difference. Is that the base on top of GPT-4 is built, predicts the next word with greater accuracy. This is really important because the better a neural network can predict the next word in text, the more it understands it. This claim is now perhaps accepted by many at this point, but it might still not be intuitive or it’s not completely intuitive as to why that is.

So, I’d like to take a small detour and to give an analogy that will hopefully clarify why more accurate prediction of the next word leads to more understanding, real understanding. Let’s consider an example. Say you read a detective novel. It’s like complicated plot, a storyline, different characters, lots of events, mysteries like clues, it’s unclear. Then, let’s say that at the last page of the book, the detective has gathered all the clues, gathered all the people and saying, “okay, I’m going to reveal the identity of whoever committed the crime and that person’s name is”. Predict that word. Predict that word, exactly. My goodness. Right? Yeah, right. Now, there are many different words. But predicting those words better and better and better, the understanding of the text keeps on increasing. GPT-4 predicts the next word better.

Jensen: Ilya, people say that deep learning won’t lead to reasoning. That deep learning won’t lead to reasoning. But in order to predict that next word, figure out from all of the agents that were there, and all of their strengths or weaknesses or their intentions, and the context, and to be able to predict that word, who was the murderer, that requires some amount of reasoning, a fair amount of reasoning. How is it that it’s able to learn reasoning? And if it learned reasoning, one of the things that I was going to ask you is of all the tests that were taken between ChatGPT and GPT-4, there were some tests that GPT-3 or ChatGPT was already very good at. There were some tests that GPT-3 or ChatGPT was not as good at. That GPT-4 was much better at. And there were some tests that neither are good at yet. I would love for it. And some of it has to do with reasoning, it seems, that maybe in calculus, that it wasn’t able to break maybe the problem down into its reasonable steps and solve it. But yet in some areas, it seems to demonstrate reasoning skills. And so is that an area that in predicting the next word, you’re learning reasoning? And what are the limitations now of GPT-4 that would enhance the ability to reason even further?

Ilya: You know, reasoning isn’t this super well-defined concept. But we can try to define it anyway, which is when you maybe, maybe when you go further, where you’re able to somehow think about it a little bit and get a better answer. Because we’re reasoning. And I’d say that our neural nets, maybe there is some kind of limitation which could be addressed by, for example, asking the neural network to think out loud. This has proven to be extremely effective for reasoning. But I think it also remains to be seen just how far the basic neural network will go. I think we have yet to tap fully tap out its potential. But yeah, I mean, there is definitely some sense where reasoning is still not quiet at that level as some of the other capabilities of the neural network. Though we would like the reasoning capabilities of the neural network to be high. Higher. I think that it’s fairly likely that businesses usual will keep, will improve the reasoning capabilities of the neural network. I wouldn’t necessarily confidently roll out this possibility.

Jensen: Yeah, because one of the things that is really cool is you ask, you ask, ChatGPT, your question, but before it answers the question, tell me first of what you know and then answer the question. Usually when somebody answers a question, if you give me the foundational knowledge that you have or the foundational assumptions that you’re making before you answer the question, that really improves my believability of the answer. You’re also demonstrating some level of reason, you’re demonstrating reasoning. So it seems to me that ChatGPY has this inherent capability built in to it.

Ilya: To some degree, one way to think about what’s happening now is that these neural networks have a lot of these capabilities. They’re just not quite very reliable. In fact, you could say that reliability is currently the single biggest obstacle for these neural networks being useful, truly useful. If sometimes it is still the case that these neural networks hallucinate a little bit or maybe make some mistakes which are unexpected, which you wouldn’t expect the person to make. It is this kind of unreliability that makes them substantially less useful. But I think that perhaps with a little bit more research with the current ideas that you have and perhaps a few more of the ambitious research plans, you’ll be able to achieve higher reliability as well. And that will be truly useful. That will allow us to have very accurate guardrails which are very precise. And it will make it ask for clarification whether it’s unsure or maybe say that it doesn’t know something when it does, when it doesn’t know and do so extremely reliably. So I’d say that these are some of the bottlenecks really. So it’s not about whether it exhibits some particular capability, but more how reliably exactly.

Jensen: One of speaking of factualness and factfulness, hallucination, I saw in one of the videos a demonstration that links to a Wikipedia page, does retrieval capability? Has that been included in the GPT-4? Is it able to retrieve information from a factual place that could augment its response to you?

Ilya: So the current GPT-4, as released, does not have a built-in retrieval capability. It is just a really, really good next-world predictor, which can also consume images by the way. We haven’t spoken about it. I’ll move on. It is a really good image, which is also then fine tuned with data and various reinforcement learning variants to behave in a particular way. It is perhaps, I’m sure someone will, it wouldn’t surprise me if some of the people who have access could perhaps request GPT-4 to maybe make some queries and then populate the results inside the context, because also the context duration of GPT-4 is quite a bit longer now. So in short, although GPT-4 does not support built-in retrieval, it is completely correct that it will get better with retrieval.

Jensen: Multi-modality GPT-4 has the ability to learn from text and images and respond to input from text and images. First of all, the foundation of multi-modality learning, of course, Transformers has made it possible for us to learn from multi-modality tokenized text and images. But at the foundational level, help us understand how multi-modality enhances the understanding of the world beyond text by itself. And my understanding is that when you do multi-modality learning, that even when it is just a text prompt, the text prompt, the text understanding could actually be enhanced. Tell us about multi-modality at the foundation, why it’s so important and what’s the major breakthrough in the characteristic differences as a result?

Ilya: So there are two dimensions to multi-modality. Two reasons why it is interesting. The first reason is a little bit humble. The first reason is that multi-modality is useful. It is useful for a neural network to see. Vision in particular, because the world is very visual. Human beings are very visual animals. I believe that a third of the visual core of the human cortex is dedicated to vision. And so, by not having vision, the usefulness of our neural networks, though still considerable, is not as big as it could be. So it is a very simple usefulness argument. It is simply useful to see. And GPT-4 can see quite well. There is a second reason to division, which is that we learn more about the world by learning from images, in addition to learning from text. That is also a powerful argument, although it is not as clear cut as it may seem. I’ll give you an example. Or rather, before giving an example, I’ll make the general comment. For a human being, as human beings, we get to hear about one billion words in our entire life.

Jensen: Only one billion words?

Ilya: That’s amazing. That’s not a lot. Yeah, that’s not a lot. So we need to compliment. We need to…

Jensen: Does that include my own words in my own head?

Ilya: Make it two billions, like you want. But you see what I mean. We can see that because a billion seconds is 30 years. So you can kind of see, like we don’t get to see more than a few words a second, then we are asleep half the time. So a couple of billion words is the total we get in our entire life. So it becomes really important for us to get as many sources of information as we can. And we absolutely learn a lot more from vision. The same argument holds true for our neural networks as well. Except for the fact that the neural network can learn from so many words. So things which are hard to learn about the world from text in a few billion words may become easier from trillions of words.

And I’ll give you an example. Consider colors. Surely one needs to see to understand colors. And yet, the text only neural networks, who never seen a single photon in the entire life. If you ask them, which colors are more similar to each other, it will know that red is more similar to orange than to blue. It will know that blue is more similar to purple than to yellow. How does that happen? And one answer is that information about the world, even the visual information, slowly leaks in through text. But slowly, not as quickly. But then you have a lot of text. You can still learn a lot.

Of course, once you also add vision and learning about the world from vision, you will learn additional things which are not captured in text. But it is, I would not say that it is a binary. There are things which are impossible to learn from text only. I think this is more of an exchange rate. And in particular, as you want to learn, if you are like a human being, and you want to learn from a billion words, or a hundred million words, then of course, there are the sources of information become far more important.

Jensen: Yeah. And so, you learn from images. Is there a sensibility that would suggest that if we wanted to understand also the construction of the world as in, you know, the arm is connected to my shoulder, and that my elbows connected, that somehow these things move, the animation of the world, the physics of the world. If I wanted to learn that as well, can I just watch videos and learn that?

Ilya: Yes.

Jensen: And if I wanted to augment all of that with sound, like for example, if somebody said, the meaning of great, great could be great, or great could be great, you know. So one is sarcastic, one is enthusiastic. There are many, many words like that, that’s sick, or I’m sick, or I’m sick, depending on how people say it, would audio also make a contribution to the learning of the model, can we put that to good use soon?

Ilya: Yes. Yeah. I think it’s definitely the case that, well, you know, what can we say about audio? It’s useful, it’s an additional source of information, probably not as much as images of video, but there is a case to be made for the usefulness of audio as well, both on the recognition side and on the production side.

Jensen: When you, when you, on the context of the scores that I saw, the thing that was really interesting was the data that you guys published, which one of the tests were performed well by GPT-3, and which one of the tests performed substantially better with GPT-4? How did multimodality contribute to those tests, do you think?

Ilya: Oh, I mean, in a pretty straightforward way, anytime there was a test where a problem would, where to understand the problem, you need to look at a diagram, like for example, in some math competitions, like there is a, math competition for high school students called AMC-12. And there, presumably many of the problems have a diagram. So GPT-3.5 does quite badly on that, on that, on that, on the test.

GPT-4, with text only, does, I think, I don’t remember, but it’s like, maybe from 2% to 20% accuracy of success rate. But then when you add vision, it jumps to 40% success rate. So the vision is really doing a lot of work. The vision is extremely good. And I think being able to reason visually as well and communicate, visually, will also be very powerful and very nice things, which go beyond just learning about the world. There are several things you’ve got to learn, you can learn about the world. You can then reason about the world visually, and you can communicate visually. Where now in the future, perhaps, in some future version. If you ask your neural net, hey, explain this to me, rather than just producing four paragraphs, it will produce, hey, like, it’s here’s like a little diagram, which clearly conveys to you exactly what you need to know.

Jensen: That’s incredible. One of the things that you said earlier about an AI generating tests to train another AI, there was a paper that was written about. And I don’t completely know whether it’s factual or not. But there’s a total amount of somewhere between 4 trillion to something like 20 trillion useful tokens, and language tokens, that the world will be able to train on over some period of time. And we’re going to run out of tokens to train. Well, first of all, I wonder if you feel the same way. And then the secondarily, whether the AI generating its own data could be used to train the AI itself, which you could argue is a little circular. But we train our brain with generated data all the time by self-reflection, working through a problem in our brain, I guess, neuroscientists suggest sleeping. We do a lot of fair amount of developing our neurons. How do you see this area of synthetic data generation? Is that going to be an important part of the future of training AI and the AI teaching itself?

Ilya: Well, that’s, I think, like, I wouldn’t underestimate the data that exists out there. I think this probably, I think, is probably more data than people realize. And as to your second question, certainly, a possibility remains to be seen.

Jensen: Yeah. Yeah, it really does seem that one of these days, our AI’s are one we’re not using it, maybe generating either adversarial content for itself to learn from or imagine solving problems that it can go off and then improve itself. Tell us whatever you can about where we are now and what do you think will be? Not too distant future, but pick your horizon a year or two. What do you think this whole language model area would be in some of the areas that you’re most excited about?

Ilya: Predictions are hard and it’s a bit, although it’s a little difficult to say things which are too specific, I think it’s safe to assume that progress will continue and that we will keep on seeing systems which astound us in there in the things that they can do. And the current frontiers are, will be centered around reliability around the system, can be trusted, really get into a point where you can trust what it produces, really get into a point where if it doesn’t understand something, it asks for a clarification. Says that it doesn’t know something, says that it needs more information. I think those are perhaps the biggest areas where improvement will lead to the biggest impact on the usefulness of the systems because right now that’s really what stands in the way. You have a neural network, you ask a neural network to maybe summary some long document and you get a summary. Like are you sure that some important detail wasn’t omitted? It’s still a useful summary, but it’s a different story when you know that all the important points have been covered. At some point, and in particular, it’s okay like if there is a ambiguity, it’s fine, but if a point is clearly important, such that anyone else who saw that point would say this is really important when the neural network will also recognize that reliably. That’s when you know. Same for the guardrails, same for its ability to clearly follow the intent of the user of its operator. So I think we’ll see a lot of that in the next two years.

Jensen: That’s terrific because the progress in those two areas will make this technology trusted by people to use and be able to apply for so many things. I was thinking that was going to be the last question, but I did have another one. Sorry about that. ChatGPT to GPT-4. GPT-4, when at first when you first started using it, what are some of the skills that it demonstrated that surprised even you?

Ilya: Well, there were lots of really cool things that it demonstrated, which were quite cool and surprising. It was quite good. So I’ll mention two exercises. So let’s see. I’m just trying to think about the best way to go about it. The short answer is that the level of its reliability was surprising, where the previous neural networks, if you ask them a question, sometimes they might misunderstand something in a kind of a silly way, where the GPT4 that stopped happening. Its ability to solve math problems became far greater. It could really do the derivation and long complicated derivation, they could convert the units and so on. That was really cool. Like many people…

Jensen: Works through a proof. Yes, pretty amazing.

Ilya: Not all proofs, naturally, but quite a few. Or another example would be, like many people noticed that it has the ability to produce poems with every word starting with the same letter or every word starting with some… It follows instructions really, really, not perfectly steel, but much better before. Yeah, really good. And on the vision side, I really love how it can explain jokes, it can explain memes. You show it a meme and ask it why it’s funny, and it will tell you. And it will be correct.

The vision part, I think, is very… It was also very… It’s like really actually seeing it when you can ask follow-up questions about some complicated image with a complicated diagram and get an explanation. That’s really cool. But yeah, overall, I will say, to take a step back. I’ve been in this business for quite some time, actually almost exactly 20 years. And the thing which I find most surprising is that it actually works. Like, it turned out to be the same little thing all along, which is no longer little, and it’s a lot more serious and much more intense. But it’s the same neural network, just larger, trained on maybe larger datasets in different ways, with the same fundamental training algorithm. So it’s like, wow, I would say this is what I find the most surprising. Whenever I take a step back, I go, how is it possible with those ideas, those conceptual ideas about, well, the brain has neurons, so maybe artificial neurons are just as good. And so maybe we just need to train them somehow with some learning algorithm, that those arguments turned out to be so incredibly correct. That would be the biggest surprise, I’d say.

Jensen: In the 10 years that we’ve known each other, the models that you’ve trained and the data you’ve trained from what you did on AlexNet to now is about a million times. And no one in the world of computer science would have believed that the amount of computation that was done in that 10 years time would be a million times larger in that you dedicated your career to go do that. You’ve done two many more, your body of work is incredible, but two seminal works and the invention and the co-invention with AlexNet and that early work and now with GPT at OpenAI. It is truly remarkable what you’ve accomplished. It’s great to catch up with you again, Ilya, my good friend. And it is quite an amazing moment. And today’s talk, the way you break down the problem and describe it, this is one of the best PhDs beyond PhD descriptions of the state of the art of large language models. I really appreciate that. It’s great to see you. Congratulations. Thank you so much. Yeah, thank you. It’s all so much fun. Thank you.

Get The Memo

by Dr Alan D. Thompson · Be inside the lightning-fast AI revolution.
Bestseller. 10,000+ readers from 142 countries. Microsoft, Tesla, Google...
Artificial intelligence that matters, as it happens, in plain English.
Get The Memo.

Dr Alan D. Thompson is an AI expert and consultant, advising Fortune 500s and governments on post-2020 large language models. His work on artificial intelligence has been featured at NYU, with Microsoft AI and Google AI teams, at the University of Oxford’s 2021 debate on AI Ethics, and in the Leta AI (GPT-3) experiments viewed more than 4.5 million times. A contributor to the fields of human intelligence and peak performance, he has held positions as chairman for Mensa International, consultant to GE and Warner Bros, and memberships with the IEEE and IET. Technical highlights.

This page last updated: 24/Jan/2024.