Reinforcement Learning with Rich Sutton
#5

Reinforcement Learning with Rich Sutton

00:00:00 Tom Mitchell: Hello and welcome to Machine Learning. How did we get here? My name is Tom Mitchell and today's episode is an interview with Rich Sutton. Rich is a long time researcher in a branch of machine learning known as reinforcement learning. He and his long time collaborator Andy Barto have done a lot of work in this area over decades. In the late nineties, they wrote a book in this area in twenty twenty for Andy and Rich, jointly received the ACM Turing Award for their contributions to reinforcement learning. Now, before we get started with a conversation, let me explain very briefly, uh, what reinforcement learning is. In machine learning, we have a number of different framings of what the learning problem is, uh, partly based on what kind of training information is available. The dominant paradigm or framing of the machine learning problem is called supervised learning. For example, if you're learning to play a game like chess, then in the supervised learning problem setting, we assume that every time you're in a particular board position and try to make a move, there's a teacher available to supervise you and to tell you which move is the correct move, the best move in this particular position. And in that kind of framing, we seek machine learning algorithms that can use that kind of data. Here's a board position. Here's the correct move in reinforcement learning. We assume there is no teacher. Instead, you must simply play the game. And at the end of the game, of course, you find out whether you win or lose a reward or a penalty. But it can be quite distant in time from the move that you made. In fact, you make a sequence of moves in a game before you find out whether you win or lose. Reinforcement learning is about that latter, that second kind of framing of the problem where there is no teacher. Okay, today I have with me Rich Sutton, one of the pioneers of machine learning, uh, who in particular contributed immensely to a part of machine learning called reinforcement learning. Rich, great to have you with us.

00:02:54 Rich Sutton: Thank you so much. Tom.

00:02:56 Tom Mitchell: Let me started off with just asking, can you define reinforcement learning in one sentence?

00:03:04 Rich Sutton: I have to be careful how I start because I only get one sentence. Um, it could be.

00:03:08 Tom Mitchell: A long sentence.

00:03:09 Rich Sutton: Uh. Reinforcement learning is learning from experience by trial and error to achieve a goal.

00:03:20 Tom Mitchell: Okay, so now I'll give you as many sentences as you want. Um, tell us a little about, um, the history of reinforcement learning, how you came to decide to spend your time looking in this direction. Um, what were the things going on at the time in your head that made you think this was an interesting direction to look? How did that all developed?

00:03:50 Rich Sutton: Now, there's a lot of depth to this question, but let's just approach it incrementally. Um, when when machine learning was first being investigated, like in the fifties, um, it was natural to think about, uh, learning from experience and, and to do things that were related to reinforcement learning. Um, um, and Andy Barto gave a great, uh, lecture at, uh, the reinforcement learning conference last year on this. Um, his point was that in the very beginning, machine learning was reinforcement learning. Um, and then because they were trying to learn from trial and error and from rewards and penalties, um, and then it evolved as it became formalized into supervised learning. And the thread connection to reward, uh, was, was sort of lost. Yeah. And so, so throughout the sixties perceptron um, And pattern, pattern recognition, supervised learning, all those things became became dominant. So this started a it's interesting to talk to you about this, Tom, because, um, because you I've seen the other side, you know, the sort of almost a competition between, uh, the reinforcement learning approach and the supervised learning approach, where you have supervised learning, where you have instruction and examples and, uh, reinforcement learning where you, you the agent itself has to figure out what to do, and it may get rewards and penalties. So it has been it's been an ebb and flow. And um, the field of machine learning has always been been dominated by the more straightforward supervised approach. Um, and there was a there's, as I mentioned, the very beginning, the rewards and penalties were were very much a part of it. But then the, um, focus, As things became more clear and more better defined and more clearer, uh, learning problem then became pattern recognition and supervised learning and and, uh, this fellow, the strange, uh, fellow Harry Klopf, uh, you know, recognized this more than other people and, and wrote some reports and ultimately a book, uh, saying that something had been lost. And Andy Bartsch and I, um, uh, picked up on his work and, and and eventually realized that he was right, that something had been left out. And in some sense, it was obvious that something had been left out from the point of view of psychology, where I'd been studying how animals learn and animals learn really in both ways, in both supervised way and reinforcement way. And so, um, so we picked up on that and made that into a well-defined area in the When was that? That would have been in the eighties. And then finally you wrote a book on it in ninety eight. So then it became a clear, uh, subfield of machine learning. Yeah. The key thing is, why is why why is I, the way I say it to myself is that why is reinforcement learning off? Why is it powerful? Potentially powerful. It's powerful because it's it's learned. It's really learning from experience, learning from the normal data that an animal or a person would get and doesn't require a prepared special data like you, of course, do in supervised learning. So if you the more you're able to just learn from from what happens. Um, then the more, uh, powerful you can be, because the more generally applicable you can be. So this phrase, learning from experience, uh, has been resonating with me more and more that learning from ordinary, unprepared, uh, things that happen in the life of an agent. Um. The original thing is, uh, Alan Turing. And now that I got the Turing Award, and someone pointed out to me that Alan Turing said, um, in nineteen forty seven before there was a field of AI, and it has even claimed to be the first public presentation on AI, uh, was in nineteen forty seven, in a lecture to the London Mathematical Society. Um, uh, and and Turing talks about learning from rewards and penalties. And it has this line what he says, quote, what we really want is a machine that can learn from experience.

00:08:49 Tom Mitchell: Pretty good.

00:08:51 Rich Sutton: Yeah.

00:08:52 Tom Mitchell: Pretty good. So when I think of reinforcement learning, say in contrast to supervised learning, um, one of the key differences, of course, is that supervised learning, you have an input X and an output label Y, and somebody tells you, uh, at least in supervised learning, here's some x y pairs. You figure out what the mapping in general is. But when you say learning from experience, then experience often involves a substantial sequence of things that you do the cat getting to escape from the box and eventually getting to the food to, uh, to, to get the reward. And so that notion of distant reward from a sequence of actions that you're doing seem pretty fundamental. And your work. I remember seeing the work that you were doing on temporal difference learning in particular to address that. And I thought, oh, this really matters. You know, at the time you see different papers, you think, okay, that's interesting, that's interesting. But once in a while you see one, you say, oh, this really matters because it it could change the way we think about it. Can you just talk a little bit about.

00:10:16 Rich Sutton: Temporal difference learning? It's the best thing I ever did. It's, um, it's it comes again from thinking about animals. Because animals do this, and, um, it becomes clear that from like, yeah, think about an animal. Even if an animal is doing something like pressing a bar to get food, um, um, the animal learns it in a couple parts. First, it learns that, oh, when? Just before the food arrives, I hear the the machinery of the food delivery system, you know, making little noises, you know, and, um, it really it turns out that the, the effective, uh, time of, of reward is when those sounds first happen. Um, so, so yeah. So if you can, if there's any clue that something good is happening, then that clue, that cue I should say is, um, becomes rewarding itself. And it's called secondary reinforcement. And it's a very well developed thing in psychology. And so you just start thinking about that and eventually it leads to temporal difference learning. Temporal difference learning just means that you're, uh, you're alert to the change over time of your prediction. So here we're predicting reward and and and then the effective reward is when, when you realize that the reward is coming. So that's a temporal difference. You didn't think it was coming and now you think it is coming. So there's an increment there. It means. It means you should have thought it was coming even earlier. Uh, and so the change then temporal difference just means change. Uh, differences is a is a, uh, the change over time is, uh is is is is where temporal difference means. So that's what reward. What is the error for your prediction.

00:12:17 Tom Mitchell: Yeah I think that was a great, um, contribution and a great.

00:12:24 Rich Sutton: Step.

00:12:25 Tom Mitchell: first of, of of research advance. Um, so a little more broadly then, what do you think are the successes of reinforcement learning in practice. Like what what do you think was as you were working in this area, what was the first time where you thought, oh, that's a success. We finally have evidence that this is not just a nice theory, but something that could be practical.

00:12:59 Rich Sutton: Well, the first application success was Td-gammon, which Jerry Tesauro used the Lambda algorithm to beat to play world champion level back end in the nineteen nineties. In nineteen. Ninety two, when the paper came out, um. And, uh, and then that was done again on a larger scale with alpha AlphaGo and then really perfected with AlphaZero. Uh, when AlphaGo became the world beat, the world's best go players, and AlphaZero did that without any, Without any supervised learning. The first system has learned from human examples of good play, and then AlphaZero was able to learn it entirely by Self-play. Those are the convincing ones. Um. Well, also, GT Sophy was good, and, uh, there were lots of things. It's funny, though, you know, none of these things involved me. You know, I, I never did I've never done anything useful in my whole life. I just, uh, in a sense, uh, I never did anything that's directly useful. I always think of, you know, people who, you know, do all this theory or something and, and become scientists and they say, well, now is my time to give back. And, uh, and I realized I've never given back. I've never I've never done anything that's that's actually useful.

00:14:35 Tom Mitchell: And I suspect your contributions are not in question here, so I wouldn't worry about it.

00:14:40 Rich Sutton: I'm not apologizing for it. I think it's good that we can celebrate that people can do purely fundamental things and still be deserving of recognition.

00:14:52 Tom Mitchell: That's great. I'm with you on that. That's a that's a great world to live in. Um, okay. So then you've been obviously part of the machine learning broader community for decades. Um, and you've seen tremendous change during those decades. Uh, what are some of the biggest surprises in the field to you, uh, over the years?

00:15:20 Rich Sutton: Well, the large language models is is, I think, a big phenomenon, the, uh, in some sense, it's the final culmination. Uh, the I don't know, I want to say the victory of of, uh, connectionist methods. I would always have said over, over symbolic methods, you know, because. what? You know. What is the classic symbolic thing? It's language. And now we find that, oh, the neural networks can do language like, extraordinarily well. And and again, better than all the other symbolic efforts to understand language. Yeah. So so that is surprising. Uh, but it seems it seems like total a total, uh, victory for the connectionists or which I like to call them Connectionists. I don't like calling these neural networks because they're not neural networks. Neural networks are networks that are neural, and they're in our heads and they're not in our computers. You know, you at least please say for me, artificial neural networks. Um, so in the old days that we're thinking of, we call them connection systems. And I remember, you know, deliberately deciding, uh, we could call them neural networks. We could call them connections. with the right word is connectionist, because they're there. They're full of connections and they're statistical and they're a network, but they're not neural. Come on. They're not not neural at all. Um. Uh, so anyway, the answer to your question is, I think, how thoroughly we can do language with, uh, non-symbolic methods.

00:17:04 Tom Mitchell: Uh, you know, I have a related one, uh, which is that the victory of natural language is a representation over logical formalisms, which I think is closely related to what you're talking about. But, um, for many years, AI and machine learning along with it went with the idea that, uh, symbolic, logical representations of information of knowledge would be the way to go. And, um, the interesting I mean, one interesting fallout of the success of Llms. Is there a working demonstration of just how effective natural language itself is as a representation, in contrast to logical formalisms that have a smaller breadth of scope? To put it, I think that's I can't quite tease apart those two issues, but you're bringing up the the victory of connectionist systems, and it kind of seems to go hand in hand with that. Just a thought.

00:18:19 Tom Mitchell: Yeah, I think they do.

00:18:22 Rich Sutton: Let me bring up another one that that has surprised me, which is that, um, there's. Just learning. Supervised learning. Um, we, uh, hasn't progressed as far as I would have expected in the sense that, uh, we all know you from the earliest days, you and I knew that really important is the formation of appropriate representations. And, you know, the reason people can learn fast is because they have appropriate representations. So they get new data and they know just what to assign credit to and what changes to make and what changes not to make. And, uh, this is always this is always quite clear and and overtly stated in the old days, we need to figure out how to learn representations. And then we never did. I think we never did. And, uh, we don't know. We don't know how to know. Now, uh, deep learning, uh, doesn't learn representations. There's a moment we thought backpropagation enables you to learn representations, but but that hasn't played out. And, uh, yeah, we still, you know, it's forty years later or something, and we still don't know how to learn good representations. We don't know how to learn fast in a continual way. Uh, and that's just and and nowadays we don't even talk about it like we can't do it. And so we just stop talking about it.

00:20:02 Tom Mitchell: I see.

00:20:02 Tom Mitchell: So let me push on that because, um, that surprises me a little bit that you say that. Um, so when you say neural networks, we thought that backpropagation would learn representations, but that didn't play out. And you say what you mean by that. And let me preface it with an example. Suppose I have a twelve layer neural network that's looking at images. And in the very final layer there's some softmax linear Thing that's, um, uh, going to make a classification. So at that penultimate layer, at the next to last layer, there is some representation of transformed through these layers and layers of the image that somehow manages to now have the property. Then this final representation that a linear decision surface can make the classification. So do you not see that as representation learning or do you think are you thinking something else?

00:21:15 Rich Sutton: So the networks, they do find a representation that is sufficient to, um, capture the training set, but they don't find a representation that's good for learning.

00:21:31 Tom Mitchell: What would be a good. Well, I guess you.

00:21:33 Rich Sutton: Know the test if some if your representation is good for learning is that if you get new data, you make the appropriate adjustments. What do we know about deep learning is that, oh, if you get new data, you will catastrophically forget all the old things. And you, you are and or or and and you also will not be able to learn new things. So, so, uh, it does find, you know, its task. The task of backprop is to keep changing the weights until you find a way that your training set is all correct. And so it will find, you know, we'll find a solution, but it won't find something that changes. Well, and and if you remember, in the old days, going back even to Minsky Steps paper, it was all about, you know, finding finding the representation so that if I give you a new example, you attribute the, uh, the the lesson to the appropriate dimension rather than the extraneous dimension. And that's we can't just not do this at all. Yeah.

00:22:45 Tom Mitchell: Yeah. You're right. I remember some very early examples of like how to see the chessboard in a way that makes certain questions very easily answered. Early work, also by Solomon on how to change, how to see a new problem with a representation that makes the problem simple. And I think that kind of, um, short term learning of the right representation is something I'd agree is very much an open problem in the field. Speaking of open problems in the field, um, since I have you here, what what's your picture of the state of the art of the field? And in particular, are there any good PhD thesis topics that people should be aware of these days.

00:23:36 Rich Sutton: Well, I think this is one the, um. How do you learn representations that are good for learning? That's that's that's number one. That's I'm having my PhD students work on that. Um, number two is, um, planning with a learned model. We've all kinds of good planning systems, you know, like AlphaGo and the chess programs and many others. But we none of those will work with a learned model. How can we represent the knowledge that you need, uh, in the model of the world in such a way that you can plan well with it? So those are those are the two most important ones, I think.

00:24:23 Tom Mitchell: Yeah. Those are very fundamental. Uh, there are probably multiple PhD theses. I mean, each one of those to be had.

00:24:30 Rich Sutton: I'm very much in favor of of figuring out the the hard problems, the ones that we don't know how to do. And and then doing them. Doing the really hard ones, you know, and I want the opposite attitude is you shouldn't find the things that are really hard and people have been trying forever to do and they've failed. Uh, that's what you I think we should work on those. I think the PhD students should do those. I think it's a mark of a scientist. As you identify what we don't know, and you work on it as opposed to saying, oh, well, a lot of things we don't know, but this, this thing we do know, let's emphasize what we do know, and let's try to show it in a big demonstration or, or make a start up about about it to make a lot of money. Uh, yeah. If you work on what we understand or we don't understand the gaps or the things that we can do, I think we have to focus on. I guess I'm getting impatient. I want to solve every I want to solve. I want to understand how learning works, how the mind works, and how we got these gaps. We got to fill them in.

00:25:39 Tom Mitchell: Speaking of how the mind works. One of the surprises to me about the field of AI is that despite the fact that it's been around for fifty years or more, um, it still has very little connection to the study of intelligence in neuroscience and in animals. In fact, your reinforcement learning work is one of the shining examples where there really that whole line of research is jointly motivated by, uh, history and psychology as well as computation. But, um, what do you think are the.

00:26:19 Rich Sutton: Sorry for me for me, they are one. And, you know, I always talk about a science of mind that is neither, uh, engineering Nor a natural science. Psychology is a natural science. They're studying minds. Yeah. We would like to study things. That intelligence is separate from us. Separate? It's not just like, let's make money. And it's not. Let's understand nature. It's let's understand this phenomenon of mind and then we it you know, there are natural minds and that can inform us. And then, of course, we want to use what we do. I yeah, in psychology and in reinforcement learning, they are they are fully connected. For me they've always been fully connected. But also I know you're right. You're right. The field of machine learning as a whole has only been only touched sometimes on neuroscience. And when it does, when it has touched, it's It's not always been useful. It's a separate. I used to say you work. The danger of being interdisciplinary is you fall between the disciplines and you end up with no discipline.

00:27:46 Tom Mitchell: Really? Maybe a final question here. I know we're short on time. Um. Looking forward, granted that nobody can predict the future perfectly. Um. AI, machine learning, uh, computer science are all going through tremendous change right now. What's your what's your guess about what comes out on the other side and how how this evolves? Do you have any thoughts about that or advice to students who are starting a career in this area now and who might be Nervous about?

00:28:30 Rich Sutton: Well, I prefer the model that is just normal progression of an important science or understanding how minds can work, and thus a bit about how our own minds work. Um, this is obviously an important thing. Humanity has always wondered about how they work and who they, um, you know, what their role is and what they are. Uh, so this is a basic, basic science question. Basic, basic humanities question. So I think it will it will continue. Uh, it will interact. You know, the the strange thing, the strange thing about our field is that it's it's an enormous economically important industry. You know, there are literally a trillion dollar industry. Uh, yes. So alongside, we're trying to do science. We have this, uh, this, this hurricane of, uh, of Funding and money and economical impact. Alongside that warps everything, changes it from being a pure science activity. Yeah. So, um, I know we're all familiar with that, and that distorts things and introduces lots of hype. And like, even the terminology is so warped by whatever terms the industry has chosen to use. Sort of. We can't help but use them. You know, they they call something inference. And then like, you can try to be a holdout and stop calling something inference because it clearly shouldn't be called inference or attention or reasoning. You can try to hold out, but but you're you're just going to be, uh, not understood. Uh, probably. So so we have this tsunami alongside our science and that really warps everything. Now, what will happen and how will this develop? Well, I don't know. The tsunami may continue getting bigger and bigger. Or it may, uh, may maybe the hype will may cause it to be just lead to disappointment. And there'll be a cooling period and there'll be another winter, which might even be good. Who? Who knows? Um, but most likely it will continue being economically valuable and they'll continue to be a huge industry growing alongside it. And there'll be some science that will, uh, persist. And why not? Because there'll be lots of funding. And so there'll be, there'll be should be funding for, for fundamental science as well. And then we'll succeed. Will we will under we will understand how learning can work and how intelligence works and how our minds work to some extent. And we will make, uh, artificial intelligence systems that will rival humans and then surpass humans. Now, I like to think this is all good, that we just have an improved technological technology and improved ability to do things, and we are. The more intelligence there is in the world and the better the world will be. Um, and I think that will be the bottom line. Even though right now there's tremendous amounts of fear about intelligence, which is I think it's unnecessary in the sense that and counterproductive because fear is not a good way. Good attitude to deal with change, change it. But on the other hand, the fear causes people to pay attention. And it's so easy just to ignore all those scientists are doing something and, uh, I'm I'm, I, I, I don't need to pay attention, but I think it's good for the public to be paying attention to to the changes that are happening, in AI and the better understanding, the different understanding we're getting of ourselves. Um, so I wish you could be done without fear, but maybe we need the element of fear in order to get people to pay attention. I think it's quite likely that there's going to be something like discrimination against AI. We will, we will that we'll be saying, oh, there's something special about about natural intelligence, and we want to be superior because people always want to be superior in the face of change. And, uh, so there could be something like discrimination against AIS. And we will say they don't really feel pain and they don't really have goals. And it's okay for us to, to, to, uh, have them be subservient to us and, and the way machines are now, um, and there could, there could be something like discrimination and then that will have to be overcome and eventually we'll decide. Yeah, machines are minds just like our ourselves, and we should accept them and stop discriminating against them. And that that will be how things evolve, maybe over the longer term future.

00:33:46 Tom Mitchell: All right. Thank you. Rich. Rich Sutton, thank you so much for your time. Your thoughts. Plenty to think about there. And we really appreciate not only your contributions so far, but your ongoing, uh, work and thoughts and efforts to steer the community.

00:34:07 Rich Sutton: Thank you. Tom, is a pleasure.

00:34:10 Matty Smith: Tom Mitchell is the founder's university professor at Carnegie Mellon University. Machine learning. How Did we get here? Is produced by the Stanford Digital Economy Lab. If you enjoyed this episode, subscribe wherever you listen to podcasts.