Tom Mitchell Welcome to machine learning. How did we get here? I'm your podcast host, Tom Mitchell. Today's episode is an interview with Yann LeCun, one of the pioneers in the field of neural nets. Yann was born in France. Um became first, I think, widely known for his work applying neural networks to image data. Um, and has done many, many things since then. Um, he is a member of US National Academy of Sciences and the National Academy of Engineering, and in France he has been appointed a knight, that is a Chevalier of the French Legion of Honor. I hope you enjoy the episode. Today, I'm happy to say I have with me Yann LeCun, one of the pioneers in the field of machine learning. Yann. Great to have you here.
Yann LeCun Thanks for having me.
Tom Mitchell So what I'd like to get out of our conversation today is basically two things. One is your picture of the history of machine learning, how this field has evolved, but also equally important, your own personal history. What got you into this? How do you perceive it? What do you think were the important events in your own career? And so let me start there just by asking, how on earth did you get into this field? What motivated you? What? How did you end up even looking this direction?
Yann LeCun Well, so I was always fascinated by the question of intelligence, you know, whether it's artificial or natural. I thought, you know, when I was, a high school student or engineering student, I studied electrical engineering. And then one day, um, I stumbled on a on a book, uh, which was, uh, the transcription of a debate between, uh, Noam Chomsky and Jean Piaget, the developmental psychologist. And it was basically a nature nurture debate about language, whether language is learned or innate. And, um, uh, Chomsky and Piaget brought their, their teams of supporters if you want to kind of argue for their side. Um, so Chomsky, of course, was arguing for innateness of language. Um, Piaget said, well, you know, there is some structure, but, you know, it's mostly learned. And he he brought with him, um, uh. Seymour Papert from MIT was from his team. Seymour Papert had done some sort of post-doc or had worked with Piaget, because he was also fascinated by the question of of learning, and then moved on to MIT and kind of worked with Minsky and everything. And he had this whole um, uh, talk, uh, transcribed talk about the perceptron, uh, you know, the model from the nineteen fifties that, uh, you know, was kind of one of the first, uh, models capable of of learning, really. And, uh, and I read this and I was fascinated by the idea that people had worked on learning machines. I thought that was that was a fascinating idea. Um, so I started I was an undergrad, and I started, um, you know, digging the literature on this. And this was before the internet. Right. We're talking nineteen eighty, nineteen eighty, essentially. So, um, I would spend my, uh, Wednesday afternoon, we didn't have class, so I would spend them at the, uh, at the Library of India where they had like, you know, all kinds of, uh, um, uh, publications and dug the literature and realized nobody was working on this anymore. Um, you know, in, in the, in the early nineteen eighties, uh, the, the whole field was killed in the late sixties and it was killed by a book that Seymour Papert co-authored. But there it was, ten years later, you know, basically praising the perceptron as capable of, you know, surprisingly complex, uh, um, learning abilities. Um, so, so I found that kind of kind of odd. The only thing I could find that was more or less contemporary at the time were, uh, you know, obscure Japanese publications on neural nets, basically. Japan. Japan had a kind of a separate ecosystem of research. And, you know, a bunch of people kept working on on neural nets and machine learning in Japan, whereas everybody in the West had kind of basically stopped. Um, so so that's how I got interested in this. And I started doing a couple, like, you know, independent study projects with some professors at my engineering school on, on on neural nets and perceptual and machine learning. And, uh, you know, we would now call computational neuroscience and, um, things like that, and then decided that this was too fascinating for me to, uh, not pursue. And I decided to do, uh, graduate school, but my, uh, on this but my problem was nobody was working on machine learning in France. In fact, there were a few people, but they were working on symbolic machine learning. And you probably know them, uh, from from back then, it was if kondratov in Orsay and. Right. Um, and there were like, uh, you know, a few other people who were doing sort of mostly, you know, kind of symbolic machine learning. And but I was I was more interested in the sort of more neural net type type approach. So, um, so I had, uh, some sort of fellowship from my engineering school. The engineering school had computers. Um, they were giving me an office, so I didn't need any money, but I needed someone to basically, you know, be my efficient advisor and sign the papers. And I found someone who was kind of not working on neural net or machine learning at all, but said, like, you look smart enough. I'll sign the papers for you. Um, so that was my PhD advisor. He was he actually started working on neural net after I graduated. Um, uh, you know, four years later, um, Morris Milgram, a very nice gentleman. And, um, so I managed to find a community of people in France who were working on, um, what they call automata networks. And, uh, they were interested in things like self-organization, stuff like that. They they were not really connected with neural nets, but it sort of became connected to it. And they connected me with the, you know, the very small community around the world who were kind of restarting interest in neural nets. So we're talking nineteen eighty three. And, uh, so that's where I discovered, you know, John Hopfield and, and, uh, Geoff Hinton with the Boltzmann machines. And Terry and Geoff were the two people in the world I really wanted to meet, um, and, uh, I, I met Terry early in nineteen eighty five and Geoff late nineteen eighty five. They both, uh, they came to a conference, a workshop in France and, um, and they realized, uh, when Jeff and I met, um, you know, we we were so in tune, so aligned, like, we could finish each other's sentences and, uh, so that's that's when, um, I became, you know, I felt like I became part of a kind of a bigger community here, which, you know, still was extremely small at the time, and in nineteen eighty six Jeff invited me to the first connection in summer school at Carnegie Mellon. This was my first experience on an American campus. And, um, actually, I'd been to Stanford before a couple of years before, but, um, um, that was really kind of the start of, uh, the the so-called connectionist, uh, uh, community. And I met, um, Michael Jordan there, and we became fast friends because my English was terrible and he could speak French. So that was. Um, so that's how I got I got into it and, um, and pretty early on, I figured out, you know, by, by reading all the literature from the sixties, what people were after was, uh, essentially a learning procedure that could train multi-layer neural nets. Right? People were already kind of looking for this, but. But nobody really found one that, uh, that worked mostly because the, the, the type of neurons people were using at the time were binary. And so if you have binary neurons, you can't you don't have the idea of using gradient descent or anything like this. Right. Um, and it's uh, but I figured, um, around nineteen eighty two, I was not graduating. Um, I had not graduated yet from engineering, that there had to be some sort of way of training those multilayer neural nets by propagating some signal backwards. And so I came up with an algorithm. It was very sort of intuitive thing that, um, I think now we would call this target prop, but it was kind of a you could think of it as an early version of backprop. Um, and I wrote a paper about it, you know, after the first year of my graduate school and published it in the conference that Geoff saw in, um, and that's that's how I kind of figured, you know, my paper was written in French, but he could read the math, and he figured it was kind of like backprop. And he was working on backprop at the time. He hadn't published anything yet, but, uh, um, that's, uh, that's a bit of the the history there.
Tom Mitchell That's amazing. So then, around that time, I guess the PDP book, the Parallel Data Processing Book, which was a compendium of neural network papers that appeared around nineteen eighty six, if I recall. Right. It was that was kind of the explosion of interest.
Yann LeCun Right. So at this summer school in at CMU, there were, Jim McClellan was circulating, uh, you know, kind of photocopied chapters of that book, but the book wasn't, you know, hadn't appeared yet. Um, but the buzz was, um, so, you know, nothing has been published on backprop yet. Um, but, uh, but that was kind of the, the, the thing that people were talking about at the summer school, essentially. And I gave a talk on my version of backprop at the summer school. I was a student. But, you know, Jeff gave me a, a slot to kind of talk about my, my thing.
Tom Mitchell So I was always sorry I missed that meeting in retrospect, because nineteen eighty six was actually the year that I moved to CMU. Oh, wow. And Jeff and Alan Newell and I team taught a course on architectures for intelligence. And, um, somehow I was out of town or not smart enough to to get myself over there, but, um, it's but I love that story. So then at that point, so there was other work in machine learning at that point to, uh, work on decision trees, various things. Um, did that did you look at that? Was that of interest, or was it pretty clear to you that neural networks were the direction you wanted to take it?
Yann LeCun Yeah, it was pretty clear to me neural nets were the way to go. But I had, you know, my Bible when I was, uh, starting graduate studies was, um, you know, the Duda and Hart um, book, at least the first part of it, which was all about statistical pattern recognition, which basically was the, you know, the the hair of the perceptron. Right? When in the late sixties, when when, you know, claiming that perceptron were a path to, to AI, um, kind of died, uh, people were working on this, basically just changed the name of what they were doing. Uh, they kept working on the same stuff. They just changed the name. Right. So an entire community like, you know, uh, Bernie Widrow at Stanford, uh, renamed this whole field adaptive filter. Uh, you know, adaptive filters. Adaptive filters had a huge impact on modern technology, right? We wouldn't have, you know, modems or the internet or anything or mobile, you know, um, mobile communication or anything without without this. Um, and it's, you know, the underlying algorithms is basically the same as, you know, not the perceptron, the other line, but, uh, but it's, you know, so that became one, one branch, and then the other branch was renamed Statistical Pattern Recognition, which sounds much more serious, right, than other than perceptron or or, you know, AI, machine learning, whatever. Machine learning was not really kind of a field. I mean, it was, but it was small, right? I mean, you were in it, so you know that better than me. Um, and then, yeah, there were, you know, methods like, uh, uh, you know, what was not yet called, uh, uh, kernel methods, right? Which, you know, there is something you do darn hard about this. Um, there were, you know, linear classifiers, of course, like, a lot of work on this, uh, polynomial classifiers, uh, you know, things like that. And, um, you know, I found all that fascinating, um, uh, and, uh, and then, you know, kind of relatively separate work on, on things like classification trees and, and things like that, which, uh, um, I think was kind of a bit of a separate community at the time. And what was fashionable at the time in Pattern Recognition was was called structural pattern recognition. Right? So basically, um, encoding patterns into whether it's an image or something into a sequence of symbols and then building grammars so that, uh, when you parse the sequence of symbols, if it if it's correctly parsed by a grammar, you've recognized a particular pattern. Um, there was, you know, to some extent a giant failure, but it was fashionable at the time. So that's where most of the, uh, you know, the papers at the time were on, uh, but but ultimately, I thought, you know, the, the, um, um, the most interesting thing to work on was not just to, you know, train classifiers, which is what machine learning really was at the time, or pattern recognition. But to learn the representation, learn the features, which is what, you know, multi-layer neural nets can do. And so that's why I thought, really, we need to get this to work like, you know, train multilayer neural nets. And, uh, you know, backprop ended up being a better solution than we thought at the time, actually.
Tom Mitchell Right. So then, uh, going forward, you're thinking about your own career and your own, uh, contributions in that area. Take us forward from like, that point in the mid eighties, you did some Very well regarded work in the late eighties early nineties. Tell us about that.
Yann LeCun Right. So, um, I was really, um, you know, inspired by neuroscience and by some work I had read, uh, you know, classic neuroscience. Uh, like Hubel and Wiesel, uh, work on the visual cortex architecture, visual cortex. And there were models in Japan mostly that had been developed, like, you know, fukushima's neocognitron, which basically tried to mimic the architecture of the visual cortex and build a kind of a artificial neural net to, to, to do this. But Fukushima didn't have backprop, right. So he couldn't really train his system, um, end to end. And so, um, uh, when I, uh, you know, when, when backprop became, became a thing, um, I was, um, you know, finishing my PhD, basically. And, uh, and I said, like, this is what I want to build. And in fact, the, the system that I built during my PhD had multiple layers, local connections. Um, and, you know, there were not convolutional, there were no shared weights, but they had local connections. And I was kind of trying to mimic a little bit the architecture, the Google result type, uh, architecture and using my, you know, pseudo backprop that actually propagates virtual targets, not not gradients, um, to, to train this and, you know, sort of work ish. Um, okay. Then, you know, backprop, you know, became the correct thing to do. And so, um, so I said, okay, what I need to build now is basically a multi-layer neural net with local connections and with this idea that you replicate the filters all over because images are, you know, shift invariant, basically. Um, but, you know, back in those days, uh, it's not like you had, you know, Python and PyTorch and stuff like that. And, and, you know, some metrics engine, you had to basically program everything from scratch. Um, so I was about to, to finish my PhD and I said, like, I'm going to write a neural net simulator that has enough flexibility so I can I can build this. And I got together with, uh, a young student who was finishing his undergraduate studies called Leon, uh, pretty famous guy now, um, and he came to me and he said, like, you know, I need to do a project for my study, and I want to work on neural nets. And you are the only person in France who's working on this. So can I work with you? Um, so I said, sure. So we work together for about six months. The last my last semester of, uh, before graduating, and we wrote a neural net simulator that had all the nice bells and whistles, um, and then, um, mid nineteen eighty seven, I moved to Toronto to do a postdoc with Geoff. And I completed the this, uh, this simulator. Geoff thought I was not doing anything because I was just basically hacking, you know, all the time. Um, and, um, uh, and this, this system was kind of, uh, interesting because we had to build a front end language to interact with it. And that language was the Lisp interpreter that Leon and I wrote. And so we're using Lisp, even though as a front end to kind of a neural net simulator. Um, and I, you know, implemented the weight sharing, uh, abilities and all that stuff and started experimenting with what became convolutional nets. But, you know, when I was a postdoc in Toronto, uh, early nineteen eighty eight, roughly, and started to get really good results on, you know, very simple shape recognition, like, you know, handwritten characters that had drawn with my mouse or something like that. Right. And then I was recruited by Bell Labs. And so I, I get to Bell Labs in New Jersey, uh, late nineteen eighty eight, in October. And I already have all the code for convolutional net and, you know, kind of simple experiments with small data sets. And the the the the group I, I came in uh, was directed by Larry Jackel. Um, they had a, a huge dataset of, um, uh, zip code digits that they got from University of Buffalo or something. And, and it was huge in the sense that it had seven thousand training samples. I mean, seven thousand samples, five thousand training samples, that order of magnitude. So I say, wow. Um, and also they gave me a big computer. So, uh, so I started turning the crank on, on, on convolutional nets and got, like, super good results that were beating whatever result. You know, other people had obtained on this data set within two months. Um, and so, uh, that was kind of quick success. We wrote a paper for neural computation, um, and sort of improved the architecture. Had a paper at NeurIPS. Uh, I mean, it was called NIPS at the time, still, uh, in nineteen eighty nine. And, um. And I think that that work attracted quite a bit of, of, of attention, in particular from some development groups at Bell Labs who say, like, we can use this for products. So they started partnering with us, uh, with a research group and, and kind of basically develop products that could, could be, could be used for reading, uh, facts forms. Uh, you know, this was before the internet. Right. And, uh, for reading zip codes, reading, uh, checks, reading the amounts on cheques. And that became commercial products eventually in the mid nineties. Um, so that that was kind of exciting.
Tom Mitchell That's pretty cool. So convolutional nets, of course, are the idea that, among other things, that you can look at different locations and different scale, um, for patterns in the image. And that was a breakthrough really, because up to that point. If you just treat the raw pixels of the image of a character, for example, and try to train a classifier, then if somebody happens to draw the character at a small scale or a large scale, or at a different location on the screen, you have to somehow compensate. And convolutional nets turned out to be the solution of choice for for overcoming that problem. You did. You did some other work at the time that I still remember, that I thought was really innovative around something you called tangent prop. Can you recall that or is that too obscure for, uh, for this discussion?
Yann LeCun No, it's not tangent prop. Yeah. This was actually, uh, a lot of the ideas there, uh, were not mine. They came from Patrice Simard. So Patrice Simard was, uh, was a initially a postdoc at Bell Labs and then became a research scientist. And um, and he had this idea that you could regularize, uh, neural nets by, um, by telling them, okay, not only do you need to produce this output for this particular input, but you also need this output to not change whenever I modify the input in a particular way. So so if I take, let's say, the image of a character, let's say a digit or something, right. Uh, and I distort it a little bit. Let's say I rotate it a little bit, or maybe I translate it a little bit or distort it in some way. Um, I want the output to basically remain invariant. So basically I want the, the derivative of the input output function of this network to be zero in some direction, which corresponds to, uh, direction that, that don't affect the nature of the, of the system. And so we, uh, uh, implemented this and it it turned out to really improve performance of a convolutional net a little bit. But but even more, if you didn't have a structured neural net, if you had like a fully connected neural net that didn't have any a priori structure, it would improve the performance of those quite, quite a lot. Um, uh, and then there was a kind of a follow up paper called Tangent Distance, which was kind of a way of comparing, uh, shapes, uh, by making this comparison invariant to kind of small, uh, small transformations. Um, it was a very cool, very cool idea, but I can't claim credit for it. This is really very cool.
Tom Mitchell Okay. So then, um, now we're up to maybe the early nineteen nineties. Um, there were other things going on in the field of machine learning. You mentioned earlier Michael Jordan, Michael and Judea Pearl. Some other people Hey, I sort of turned their attention to what they were calling Bayes nets or Bayesian graphical models, probabilistic approaches that, um, would were kind of symbolic, but they would learn or represent and use probabilistic dependencies among different variables. That became a very popular approach in the field. Um, and I recall that in in some of the talks, people would be careful to say not that they were working on Bayesian methods, but they were working on principled Bayesian methods, just to kind of underscore the difference between, um, engineering gradients and principled probabilistic approaches. What was your, uh, what did that look like to you as that was happening?
Yann LeCun Well, it looked to me like, uh, I mean, those were interesting methods for their own sake. Uh, I thought, I mean, the original form before the machine learning, people like Mike Jordan got interested in it. There was no learning whatsoever. Right? I mean, there's people like David Heckerman and and others who are kind of doing the whole theory of inference for this. But but basically those networks had to come from somewhere. You had to build them by hand, more or less. Right. And so the idea that you could essentially train them, uh, was due to the collision between the two communities, the, the sort of Bayesian net community and the machine learning community. And I remember exactly when and where that happened. That happened at the Snowbird workshop. Um, so there was this workshop, uh, that took place in Snowbird every year since nineteen eighty six, I think. And I started going in nineteen eighty seven, and it was mostly run by my, my department head from Bell Labs, Larry Jackel. Um, Uh, and, uh, I think Mike was Mike Jordan was like one of the advisors or committee members or whatever. And he recommended that, um, we, you know, bring in, uh, uh, David Heckerman and and and Lori Spiegelhalter, I mean, all of those people were working on Bayesian nets, uh, to to, uh, to the workshop to educate the community about those methods, basically. Um, and I think that was quite, quite impactful. I mean, I thought it was a little too simplistic in terms of learning. So the, you know, eventually what became kind of trainable graphical models were still shallow networks. Essentially. There were, uh, you know, you would learn a few weights for some something like logistic regression. And that was kind of, uh, uh, all there was to it. So in my opinion, you know, it had a usefulness, uh, sort of, you know, a lot of theoretical, uh, kind of insights, but in terms of what I was interested in, which was, you know, kind of learning representations of things and doing complex tasks with machine learning, I think it was kind of a tangent. So, you know, my interest was limited. Um, but but a lot of things, really interesting stuff was developed during this context, like variational inference, you know, and things, things of that type. The other thing that was taking place, uh, and that basically kind of pushed out neural nets from the center of interest of the machine learning community was kernel methods, support vector machines and things of that type. Um, and that was taking place in the same group where I was at Bell Labs, because one of my colleagues was Vladimir Vapnik and Corinna Cortes and Isabel Guillen and Bernard Bowser. I mean, there are the people who wrote the first papers on, on, uh, you know, on Support Vector machines and convolutional nets. And I thought this was a really cool idea. The theory was very cute. Um, again, I thought it was a tangent because I don't think it was solving the problem. I was interested in of learning representations. And, uh, that all of us were interested in. But, you know, those two things, Bayesian, Bayesian nets, um, and, uh, kernel methods kind of took over the field of machine learning and basically pushed out, uh, neural nets to, uh, being the topic of mockery, essentially. Um, which is really not deserved because neural nets really worked very well. Right? I mean, the the performance was there. It's just that this is kind of curious phenomenon that because the theory of neural net, uh, was not satisfactory, essentially, um, people kind of stopped working on, on those things favoring methods where the theory was easier. And I think that's completely wrong methodologically to simply work on a particular engineering artifact, because you can understand it theoretically, and ignoring the fact that there is another way to do it that works better. But then dismissing it because your theory doesn't apply to it. I mean, I think that's just misdirected, but that's basically what happened.
Tom Mitchell Take us through the the decades that remained in terms of what you think. What were the sort of main milestones in the development then of neural networks, which came back, uh, in a big way, of course, in more recent years. But, um, take us through the developments there.
Yann LeCun Okay. So what I was working on in the early nineties up to nineteen ninety six, roughly, was what we now call structured prediction. so which is basically, uh, you know, imagine the the input to your learning system is not like a, a, a single object, but it could be like a compound object, like a word, uh, a handwritten word or a spoken word or something. And you can use a neural net to turn it into basically, uh, maybe a sequence of feature vectors, representation vectors, uh, maybe hypotheses about the, the categories of each of the objects that are underneath. Right. So speech recognition system, for example, you basically, you know, take a window of the signal that you shift over time. And for each window you have, uh, a, a list of scores for whether the, the sound in this window is a particular phone, what they call phone. Right. Which is a basic sound, essentially. Right. Um, but then you have the, the problem that uh, and then for handwritten, uh, words is the same thing, right? You never know exactly what the characters are, what they begin with, the end, or how to segment them. Right. So, you know, you can make hypotheses about what the characters are, and then you can run them through your favorite neural net. And it's going to tell you, well, this could be a four or it could be a one or whatever. Um, but then you have to figure out, you know, among this entire sequence of hypotheses, which one is the most likely to be correct? And for that you have to use a language model. Right. So you have to use you have to say, well, okay, this letter is very likely to be a Q. It looks very much like a Q. So probably the next one is a U, because in English most Q's are followed by U, uh, etc.. Um, and so um, the, the classical way people were doing this in, in speech recognition was through dynamic programming. You search a short path in a graph, and the graph represents a particular way of segmenting the sequence where each, uh, transition in this graph basically has a score that indicates whether it's a particular category or not. And so, um, that's easy, but how do you train a system like this end to end when you show it a word and then you tell it, you know, here is the transcription of this word, the sequence of characters you need to put out. But I can't tell you where the characters are. Okay. So, um, so people have come to call this structured prediction. Um, and we, um, Leon and I, as well as Yoshua Bengio and Patrick Hafler, basically devised, uh, a set of techniques to, to, to handle this, uh, which I think was quite general, uh, to train a system like this end to end. And this ended up being what was deployed commercially to read, uh, checks by our colleagues, uh, in the engineering organization. Um, so I was really very proud of that, of that work. Um, the technical part was basically completed in nineteen ninety six, and then it took us two years to write the paper. So the paper came out in nineteen ninety eight. Um, but by that time, everybody in the community had lost interest in, uh, in neural nets. And, uh, and so and this was the heydays of the internet, uh, and I was promoted to department head because of interest at shifted. Um, so I took over his group and, uh, had to decide what to do with, uh, with my new research group and decided to not work on neural nets and, and machine learning anymore. We worked on a project called, uh, Deja Vu. Deja vu, which was essentially an image compression technique to bring printed material to the internet so we could compress high resolution scan documents, uh, to kind of incredibly small size, uh, and low memory occupancy and to distribute documents over the internet. And that that had some success. So I worked on this mostly for about five years between nineteen ninety five and two thousand or nineteen ninety six two thousand and one. Um, and this is when also AT&T kind of split itself up and Bell Labs was split also. And so I stayed with AT&T. The lab became, you know, renamed AT&T labs. And the interests were kind of different. So, um, the company that the the part of the company that was commercializing those check reading systems for banks was spun off. And so we were cut off from, uh, the product outlet, if you want, um, the development group that had, you know, developed all those character recognition, um, uh, engines, uh, went with Lucent Technologies and stayed with Bell Labs and we went with AT&T. And so the entire project was dissolved basically in nineteen ninety six. And I was kind of depressed for a while because, like, you know what I thought I was super proud of? Like, I could not work on this anymore. Um, essentially. Um, and what's even funnier is that, um, in nineteen ninety, uh, we filed a patent on convolutional nets, and, uh, so there's a pattern. There's two patterns, actually, on convolutional nets. Um, and that pattern, when it split up, was assigned to NCR, which was the subsidiary that was building those machines. But the funny thing is, nobody had NCR had any idea what a convolutional net was. And they own the patent. And, you know, we were at AT&T and basically another company was owning our a piece of our brain essentially. Right. Um, that was really depressing. So I became kind of, uh, a bit of a sort of an anti-pattern, uh, activist afterwards. Uh, I think it's hurtful because it stops people from, you know, kind of working essentially, uh, on, on their own ideas. So, um, so that was kind of, uh, disappointing. But those patents expired in two thousand and seven because at the time, the, you know, the validity for patent was seventeen years. Um, I popped the champagne when that happened, but by then I was at I was I was at NYU and I started I restarted working on neural nets in, uh, in two thousand and two when I left, I left AT&T early, early two thousand and two, um, with the idea of basically restarting, restarting a research program on, on, you know, on neural nets, uh, mostly for the purpose of applying it to computer vision, become connected with the computer vision community, and thought there was a lot to be to be done there, to kind of bring machine learning to computer vision. Um, and, uh, and so that's, that's what I started. I joined the NEC Research Institute in Princeton, but only for eighteen months because the place was kind of not going well, and then became a professor at NYU in two thousand and three. And, uh, and that's when, uh, Jeff, uh, came back to, uh, Canada. He was in the UK for a few years, came back to Canada and had was, you know, started a program funded by Cfar, the Canadian Institute for Advanced Research, on the whole idea of, you know, neural computation and, uh, perception, um, and neural nets. So we we started kind of rebuilding a small community around neural nets with the explicit purpose of reviving the interest of the machine learning community in neural nets. Uh, it took about ten years, but it succeeded beyond our wildest dreams. So, yeah.
Tom Mitchell Great. I guess the most visible outcome of that was around, uh, the competition at the computer vision conference. Um, uh, which was two thousand twelve. Was that. Um, yeah. When, when, uh, neural net approaches totally dominated the competition, even though most of the competitors were other kinds of approaches. And within a year or two, the entire conference had kind of flipped over to neural network approaches to vision. Right. So that was, um, beginning of the period that some people refer to as the deep network period. But yeah, take us there.
Yann LeCun So it started a bit earlier. So, so Jeff, uh, Yoshua Bengio and I, uh, were kind of running a workshop, um, uh, basically the follow up of the of, of the workshop where Yoshua and I were running that. And then with Jeff, we we usually had a workshop around around NIPS, uh, on what we had rebranded Deep Learning. Um, so why did we rebrand the whole neural net approach to deep learning? The reason was neural net was again a subject of mockery. And so we we thought it was kind of, um, a good idea to to basically broaden the scope of it a little bit and call it deep learning. Uh, you know, basically just the idea that you train a complex learning machine with multiple nonlinear steps, right? I mean, it's nothing more than that, nothing more deep than that. Uh, and, uh, we don't remember who kind of came up with the phrase, but, um, but it sort of became a meme. So. So in two thousand and seven, we organized a workshop at, at NIPS, um, and we proposed a workshop as an official workshop for NIPS. The NIPS denied it. The workshop organizers like, You know, kind of refuse to to to let us, uh, hold it. So we organized a pirate workshop. Um, that was funded by Cfar. And it was extremely successful. There were like, hundreds of people showed up and it was basically, I think it marked the real start of the, the or the rebirth of the deep learning community, if you want. Um, and so it became the point where there were enough people interested in this question. Then when we would submit a paper to NIPS or icMl or something, it would be reviewed by other people who are also interested in it and knew something about it, which wasn't the case before, like all the papers were all, you know, mostly rejected because like, you know, nobody wanted to hear about neural nets. And that started changing around two thousand and seven or eight. And so we started like really building a sort of, you know, increasingly large community around this and sort of new ideas came up. Um. And what happened? Um, is that what all of us were obsessed with? Uh, so that included groups like Andrew Ng and things like that. And we were all obsessed with, uh, unsupervised learning, essentially, because we thought if we want to scale up those neural nets and train very deep neural nets, what we needed to do was to pre-train them so that, um, you know, each layer, if you want, would do something useful before we can apply backprop. Backprop would be kind of a fine tuning, uh, process if you want. Um, that was the hypothesis that, uh, Geoff Yoshua and I worked on and enjoying as well. Um, and it turns out that wasn't useful because that you could train a system end to end with deep learning if you use, uh, relus instead of hyperbolic tangents, if you do a normalization properly, if you use a few tricks, uh, then you don't need any of this. Um, and so that was a surprise. Um, and, uh, and there were like, you know, early results in speech recognition. I mean, we had some results also in image recognition, like, for example, on, uh, semantic segmentation of images, where we applied convolutional nets and they worked really well. Um, we actually had a paper that we submitted to cvPr twenty eleven or twenty twelve, I think it was twenty twelve. It was submitted in twenty eleven, but that was for twenty twelve. And uh, and we were using a convolutional net to basically label every pixel in an image with the category of the object it belongs to. Okay. So that requires having a data set that has been basically labeled at the pixel level. The region level. And there was a data set that Antonio Torralba had put together that had only a few like a couple thousand images, which was not a lot, but it was enough to kind of train the neural net. And so we had really results, you know, state of the art. Uh, it was fifty times faster than the best runner up. So I said, oh, great. So I wrote a paper submitted to cover the three reviews were, like, incredibly negative. All three of them. Um, mostly because people had no idea what the convolutional net was in computer vision, like some people knew. But, you know, most people had no idea. And so, you know, in the in the comments on the review, there were things like, oh, why don't you learn the features where you can hardwire them? I mean, you know, it's things that seem crazy now, but, um, but that was kind of a particular mindset, right, that, that people had at the time. Uh, but mostly it was like, we don't believe that a method we never heard of could work so well. So, you know, there must be something wrong with it. Um, so, um, so actually wrote a letter to the program chair with Serge Volonghi, and I knew he could. He could do nothing. I had been program chair of CPR before, so I knew he could do nothing. But I said like, there's no point for people from machine learning to submit papers to CPR because there is nobody to understand what they're doing. So I'm going to tell my student not to bother anymore. Um, so the paper was accepted at icMl, so, you know, it still came out, but uh, and that was pre, uh, ImageNet. Right. Um, but then the real, uh, so there were signs that things were really working well with, uh, with neural nets, but, but at the end of twenty twelve, the, uh, the fact that ImageNet competition was, was won by, um, a convolutional net built by Geoff and Ilya Sutskever and, um, and, uh, Alex, uh, krizhevsky, uh, was really a bombshell and, uh, that was, uh, revealed in a workshop at the conference in Florence, I think, uh, and, uh, and it was really funny because Alex Krzyzewski is the one who presented the work, and, uh, and the room was packed because everybody had heard that he was getting those amazing results, and he didn't attempt to explain what a convolutional net was. He just he just assumed everybody knew. Right. And like, it was only like a handful of people in the room who knew, uh, who had followed my work and, you know, had, you know, so people like you, of course, knew exactly what it was. And, uh, and, you know, a few, a few other people, etc., but most of the people in the room, it was a UFO for them, right? I mean, they had no idea what this was coming from. So, so it was a bit of a shock and, uh, you know, so in twenty twelve, you couldn't get a paper accepted at CPR if you, if you use neural nets, and then by twenty fourteen, you couldn't get a paper accepted CPR if you did not use neural nets. So.
Tom Mitchell The social aspect of scientific progress.
Yann LeCun Yeah.
Tom Mitchell Very much. Um, very much. Waves of paradigms overlapping and taking over the mindshare of, uh, community of people. Um. Right. It's it's an interesting driven partly by technical developments as eventually but maybe in the short term, more by social, um, social forces.
Yann LeCun A lot of it is social. Yeah. I mean, it's surprising, uh, I mean, one lesson I learned from this is it's surprisingly difficult to snap people out of a particular way of thinking. Um, and you show them evidence, and they basically dismiss the evidence, uh, you know, until the evidence is incontrovertible. And, and at that time, you know, you get a phase change, basically. But, uh, but, you know, it's very interesting what the dynamics of this, uh, is. And, you know, no one is to blame for it or anything like, you know, this is kind of a natural. I mean, you know, the way science progresses is by basically questioning everything. Right? Uh, but questioning the common wisdom is very difficult. Like, you know, communities get into very deep local minima and they have a hard time getting out of it. You see this today with LMS. LMS are sucking the air out of the room wherever they are.
Tom Mitchell So it's it's hard to, um, it's hard to go in the other direction. Although, um, you'll have more to say about that. Take us quickly through the next decade, which got us to where we are today. And then I really do want to hear, um, I know that you've recently, uh, left Metta to to strike out a new direction. I want to hear about that, too.
Yann LeCun Okay. Uh, right. So the next decade was the decade of deep learning, right? So, uh, twenty thirteen, because of the results on ImageNet and others, um, industry started to pay attention. And, um, and so, you know, Google, um, you know, basically hired, uh, Jeff, Alex and Ilya, uh, Baidu got into a project on deep learning as well. Uh, and, you know, a few other companies, uh, IDM got interested in this for speech recognition and, uh, that actually started earlier, around two thousand and nine and ten. Same with Microsoft, but Microsoft was a bit less aggressive about it. Um, and then, you know, by mid twenty thirteen, uh, Facebook had built a small group to explore the capabilities of this, uh. By summer, they had hired one of my former students, Marc'aurelio Ranzato, who was at Google Brain, uh, to kind of, uh, you know, increase the activity around, around deep learning and neural nets. Uh, and and then it became clear that Facebook really wanted to start a significant activity in, uh, what was not yet called AI, um, uh, at least, uh, by, by everyone. And, um, and they, you know, they tried multiple approaches, maybe buying a startup or something like this or hire, you know, a bunch of junior people, and then you realize the best way to, to do it was to hire someone more senior who would be able to attract more junior people. And so that's when they approached me. And the first time was, uh, you know, around summer twenty thirteen. And I told Mark Zuckerberg, I said, well, you know, I'm, I can't I can't really help you in a big way. I mean, I can consult with you, but I can't. I can't join Facebook because I can't I'm not going to move to California, and I don't want to quit my job at NYU. I want to move from New York. So, you know, I'm sorry. Um, but then he recontacted me at the end of November twenty thirteen, and I had to go to California for some, some other reason. Um, and so he and I had a chat at his house, and, um, and he tried to explain to me what he was trying to do with, uh, AI that he was putting a lot of, uh, hope in where AI was going, and it would, it would it would be really, really useful to, to Facebook in the long run. Um, it was not at all interested in the use of AI for things like, you know, ad ranking or anything like that. He said, that's kind of boring. Um, you know, we can probably do this with like, logistic regression or whatever. Um, but he said it's more like, you know, content interpretation and and all that stuff. And, you know, he had read my papers. Okay. So I was really impressed. Um, and, uh, and, and I was kind of, uh, taken by the, the sort of long term vision that he and Mike Schroepfer, the CTO at the time, had put together. And so I said, well, I only have, you know, three, three conditions, basically. The first one is, uh, we practice open research, um, because I don't know how to do it otherwise. Uh, you know, I don't know how to hire the best. I don't know how to, um, you know, interact with, you know, universities and academic groups and things like this. So we have to basically publish what we do and things like that. Uh, the second one is, uh, I don't move from New York. Uh, and third one is I don't quit my job at NYU. I can be part time. And he said yes. So I said, okay, where do I sign? Um, because I was given the opportunity to create a research lab from scratch. There was no research culture at meta at the time. And so, uh, unlike Google that already had a research organization which was organized in ways that I actually didn't like. Um, and so, um, I thought that was kind of a much more exciting opportunity to basically just create a whole research culture from scratch and a research organization. Uh, so that became fair, uh, which initially meant Facebook research, and then eventually when fair, when Facebook changed name to meta, we changed the F to mean fundamental fundamental research. Um, and um, we had a really good run for twelve years. Uh, um, in the sense that I think fair had an enormous impact on the, on the company. First of all, uh, internally, you know, basically put the company on the map when it comes to AI research? Um, we had a big impact on the AI research community, um, with things like PyTorch and, you know, fair and, you know, uh, all the image recognition and, uh, open source systems that we put out and, uh, which is kind of a similarity search, uh, system. And I mean, there's, there's like a whole bunch there's like a thousand open source projects that fair, uh, produced, which, you know, all of which have had a very, very big impact on the on the research community, but also on the industry and on the wider world, because a lot of the techniques that were developed at fair, you know, have been used by, uh, by Facebook, meta and other companies for all kinds of stuff, which, you know, people don't realize are there but are, um, really powering a lot of, uh, a lot of, uh, a lot of systems around the world. So, so really big impact. Uh, PyTorch is probably the biggest of all because, um, it's, um, you know, the, the software platform for research in, in deep learning that is dominant, uh, it's by far the one that is the most used now nowadays. So, so yeah, it was it was an amazing, amazing run, uh, at fair with, uh, big impacts. But it's changing now.
Tom Mitchell With just, um, what would you say during that window of time were the one or two most important developments technically.
Yann LeCun Okay. So I mean, there was a bunch of things, right, that are a little kind of below the radar. But, um, so the idea that, uh, you can basically write a program in, let's say, Python or whatever language. And basically there is an automated, um, thing behind the scene that can compute the the gradient of the output of this program with respect to all the parameters, uh, inside the, inside the program. And it doesn't matter what the program does. It could have loops, it could have tests, you know, uh, you know, all kinds of stuff. Right? Um, you basically have completely automatic differentiation. The concept had been around for a while, but the the fact that it becomes kind of, uh, you know, a universal tool, uh, that was really kind of generalized by, by PyTorch in a, in a way. Uh, and so I think that, you know, has changed the mind of a lot of people in how you do, uh, science, computer science in particular, but, uh, and modeling and all that stuff. Right? Um, yeah. So that's that's a big one. Uh, then the the astonishing results of computer vision, like the fact that, you know, you can, um, you know, train basically generic computer vision systems with, you know, it's economically feasible to collect a lot of labeled data. Uh, you can also have weak labels. So, for example, at Facebook, we used, uh, Instagram photos together with the hashtags that people, you know, write for for the photos. And that's a very weak label. It's unreliable, but it turns out to actually be sufficient to train a very good vision pipeline. Um, and, and so we had amazing results in computer vision, uh, in semantic segmentation, instance segmentation, uh, object recognition tracking, body pose estimation, you know, all those things, right, which were really difficult to do, you know, you know, just ten years before. Right. Or five years before, uh, and, you know, generating captions for images and descriptions and stuff like that. Right. So that was kind of a, you know, really cool, um, you know, technique like Rxn, for example, that came out of the vision group at meta was pretty cool. I think what uh, more recently, uh, and I'm jumping ahead of a number of years there. Uh, what was really astonishing is the fact that you can train those things without any label data, basically. So you can train a vision pipeline now and it gets, uh, produce extremely good representation, generic representations for images. And you can train it completely unsupervised. You, you basically you take a you take an image, you distort it and you corrupt it in some ways. Okay. You run the original image on the distorted one through encoders, identical encoders, and then you train, uh, you know, a neural net inside the representation to predict the representation of the full image from the representation of the corrupted or partially transformed one. Um, so that's called joint embedding. Okay. Um, and uh, perhaps the best example of this in recent years to learn representations of images is Dino. Dino. V1 v2 v3 uh, produced by a group at fair in uh, in Paris. And another one is ige-fab. And, um, and the fact that this, you know, you get representations that if you now feed those representations to a classifier, a supervised classifier, you get state of the art performance, you get better performance than any supervised system. And that's kind of astonishing. It's a fairly recent it's in the last year. Um. Okay. So now going back to the, you know, twenty fourteen or so, um, the fact that you could do, uh, image recognition at a fine grain. So for with a very, very large number of categories. Well, the best example of this is face recognition. So our colleagues at Facebook built a face recognition system that could basically identify face among millions, if not a billion, actually with pretty good reliability. Uh, Facebook doesn't do face recognition anymore. The service is turned off, but the technology was there. And, uh, and the fact that this worked with a convolutional net, like, astonished even me. Right. I mean, I looked at those results and said, you guys are serious. Like, this works so well, I can I can't believe you can do classification with a million categories. Um, so that was that was a shock. Uh, a bit of a shock to me. Um, and then, um, and then, you know, around twenty fifteen, uh, there was evidence that, uh, first of all, self-supervised learning could be applied to, uh, um, text essentially representing text. right? So turning a text into a vector to represent the meaning of that text, that can be done completely self-supervised without any labeling, uh, by essentially taking a text and then removing some other words, and then training some big neural net to predict the words that are missing. Right. Um, and, and so things like word huvec and fasttext and things like that were built on this idea. And there were, um, this actually drew on some ideas that, uh, Ronald and Chisholm Western had published in the late two thousand, uh, where they had a paper with, uh, a title that that upset a lot of people in natural language processing. It was NLP almost from scratch. Okay. And what they because at the time, you know, the NLP community was using like, handcrafted features, kind of like computer vision. Community was using handcrafted features with, uh, classifiers on top. And they said, oh, you can learn the whole thing. You just train a deep learning system. Self-supervised. Take a piece of text, um, and then train a neural net to produce a high score. And then substitute or remove the the central word or keywords, and then train the neural net to give you a low score. And that seems like a pretty, you know, trivial way of, uh, uh, training system. But they showed that the kind of representations that was learned by this, uh, were amazing. And you had, like, computational properties in those representations that if you took the vector for, you know, uh, Paris and you subtract the vector for France, and then you add that to, uh, UK, you get you get London, right? I mean, so there's like, you know, algebraic properties of those representations that are kind of pretty astonishing. Um, so the idea that self-supervised learning could really kind of bring something to the table there, I think was kind of a big, um, um, sort of mind, uh, change of mindset. Uh, and then. And then there was Transformers. Of course. Right. Um, that that, um. So, so before that, there was, uh, some demonstration that, uh, you know, you could basically match the performance of classical systems for tasks like translation, uh, language translation using large neural nets like LSTMs. So this was the work by Ilya Sutskever. When he was at Google. We had this big sequence to sequence model with LSTMs and some gigantic model where you can train it to do, um, uh, translation, and it kind of works at the same level, if not better in some cases than the then classical, classical, um, translation methods. Uh, and then a few months later, Yoshua Bengio, uh, Dmitry and Cho, who is now a colleague at NYU, uh, showed that you could change the architecture and use this attention mechanism. Um, that that they propose, uh, to basically get really good performance on translation with much smaller models than what Ilya had been proposing. And the entire industry jumped on this, uh, Chris Manning's group at Stanford, kind of, you know, used that architecture and basically beat, um, you know, won the WMT competition for a particular, uh, type of translation. And the entire industry jumped on it. So within a few months after that, you know, all the big players, uh, in translation, were using attention type architectures for translation. And that's when, um, the transformer paper came out. Attention is all you need. So basically, if you build a neural net just with those kind of attention circuit, uh, you don't need much else, and it ends up working super well. And that's what started the, you know, the transformer revolution. Uh, and then after that came Bert, that also came out of Google, which was this idea of using self-supervised learning. Right? Take a sequence of words, corrupt it, remove some of the words, and then train this big neural net to reconstruct the words that are missing. Predict the words that are missing. Um. And again, people were amazed by like how how good the representations learned by the system were for all kinds of NLP tasks. And that really, uh, you know, kind of captured the imagination of a lot of people. Um, and then after that, the next revolution was, oh, um, actually, the best thing to do is you remove the encoder, you just use a decoder, um, and you just train a system, you feed it a sequence, and you just train it to reproduce the input sequence on its output. And because the architecture of the decoder is strictly causal, um, because a particular output is not connected to the corresponding input is only connected to the ones to the left of it. Implicitly, you're training the system to predict the next word that comes after a sequence of words. That's the GPT architecture that was, you know, promoted by OpenAI. And, uh, that turned out to be more scalable than Bert. And so in a sense that you can train gigantic networks on enormous amounts of data and you get some sort of emergent, uh, property. And that's what gave us llms. Okay. So, yeah, I mean, a lot of things have happened there. And if you want to say a few words about this, then, um, you know, a few keywords, right? Self-supervised learning basically, is the present and the future. Um, and, uh, training by prediction, essentially. Okay. Now with generative model, you predict the input. Uh, what I'm advocating right now, which we might talk about later, uh, is that you should not do this. In many cases this works for for language. It doesn't work for anything else, really. Um, and, uh, and, and the fact that we now have multiple architectural components that we can combine to build a deep learning system. And so things like convolutional modules and self-attention modules and memory modules and, and, you know, linear maps and nonlinearities, blah, blah, blah. Um, and one big revelation I forgot, which actually, I can't believe I forgot in twenty fifteen is the residual network architecture ResNet. Okay, that was in twenty fifteen, which is a very simple idea, but it solved a huge problem, which is that when you stack multiple layers of a of a neural net and you train it with backprop, you get some issues that the the gradients can become mushy, or they die or they explode and you have a hard time training the system. If one of the layers end up like not doing something useful. The gradients basically are not useful and the entire network kind of dies more or less. Right. So the idea resonates is super simple. You you basically have connections that kind of skip every few layers, um, without non-linearities or weights. Okay. So by default the entire network looks like the identity function. And what the trainable part of the neural net computes is a deviation from the identity function. Okay. And that turns out that if you do this you can now stack hundreds of layers, um, or at least dozens of layers. Right. And so that increased the power of, uh, of what you can do with multi-layer neural nets by a huge amount. And that paper by Hemingway, when he was the paper when he was back at, um, Microsoft Research, uh, Beijing, uh, is the most cited paper in all of science of the twenty first century.
Tom Mitchell Well, that yeah, that's been an amazing decade of developments and people building on and coming to where we are today. Tell us about your, uh, current thoughts about, um, where the field should go and where you should go next.
Yann LeCun Okay. Well, notice that in my, uh, history of the last dozen years, I did not even mention reinforcement learning. Uh, okay. Which, uh, you know, might sound strange because there have been enormous success of reinforcement learning for, you know, game playing, for example. Right. Um, but I've always been very critical of reinforcement learning in its current form because it's extremely sample inefficient. And so there was, you know, an idea, you know, back in the, you know, twenty fourteen, fifteen or sixteen time frame, uh, where, you know, prior to AlphaGo, and right after that, reinforcement learning was the ticket to building truly intelligent machines that they were basically going to be, you know, RL based, uh, methods, uh, with a slight issue that reinforcement learning is incredibly inefficient in terms of samples. And so I never believed in the primacy, if you want, of reinforcement learning. In fact, I in twenty sixteen, I, I started, uh, in my talk, I have this slide where I represent, uh, machine learning or intelligence as a cake. And I say if intelligence is a cake, the bulk of the cake is self-supervised learning. This is where most most of learning is self-supervised. It's not reinforcement. It's not task specific. It's just understand the world. Um, then there is the icing on the cake that supervised learning where you tell the system what output you want. Okay. And you're giving it quite a bit of information. And the cherry on the cake is reinforcement learning. It's so inefficient because the amount of feedback that the system gets from the environment is extremely poor. Informationally it's very weak. It's just, you know, one number, basically a reward. And so necessarily the number of trials you're going to have to make for this to work, it's going to be very large. And so the result is that this will only work in virtual environments or games where the number of possible actions is relatively small. And you can combine this with search tree search like Monte Carlo tree search. Um, and then you can use reinforcement learning to fine tune the system, have it play, you know, millions of games against itself. And that's really what happened with, you know, uh, chess and go and, um, everything, you know, all of the, all of the games. But it doesn't work in the real world. It's just not efficient enough. Uh, so the few success that we we're seeing in robotics using reinforcement learning are based on simulation, where you train the system in simulation, and then you do kind of a transfer from simulation to to the real world. And with a little bit of adjustment, this works right with imitation learning and things like that. Um, but it's not entirely satisfactory because it doesn't explain how humans and animals can learn so quickly. Right. So that's the question that's been I've been obsessed with, which is like, what is it that, you know, let's, uh, human baby to learn intuitive physics in a few months? And and, you know, that's human. Babies like animals run this much faster. Um, and, uh, what is it that, you know, would allow, uh, a young human child to accomplish a task without being trained to accomplish that task? It's the first time you ask a ten year old to, you know, clear out the dinner table and fill up the dishwasher. They can do it. They don't need to be taught to do it right. It's something they can just figure out on their own. Um, how is it that, you know, uh, seventeen year old can learn to drive in about twenty hours of practice when we still don't have self-driving cars, despite the fact that we have millions of hours of training data of expert drivers driving cars. Right. Um, and so, um, we certainly don't have AI systems that can learn to drive in twenty hours of practice. So what's what is missing there? That's the question I've been obsessed with for the last, you know, ten years, roughly. Um, and basically developing a, an approach to, to that. And I've converged on this idea of, well, model. Right. Which is not a new idea, like people have been using the concept of one model for for decades in optimal control, for example. So it's the idea that if you have some idea of the state of the world, and if you have an action that you imagine taking, can you predict the outcome of that action on the state of the world. Right. So that's a world model, a model of the environment that the system operates in. And so if you have this model, you can form the current state of the world. You can have the system predict what the result or the outcome of taking a result of a sequence of actions will be. You can compare this with a goal, um, which would characterize whether a task has been accomplished. And then through planning, you can figure out a sequence of actions that will minimize that cost function, that distance to the goal of the of the predicted final state. Okay. This is how planning has worked in, you know, in optimal control in AI for for seven decades. Um, so I think that idea is very powerful and was somewhat obscured or ignored by the reinforcement learning community. The fact that you don't need to learn a policy. You need to learn a world model. And then you can use this model for planning. And that to some extent is kind of going back to ideas in classical AI, because in classical AI, planning and search is basically, you know, a super central concept, right? There's a lot of work on this in classical AI and certainly in optimal control in motion. Motion planning in in robotics and things like that. So so I'm saying essentially use this concept model model predictive control MPC. Um, but with a model that is learned from data and possibly a hierarchical model. So you can do hierarchical planning. That's basically the program I've been pursuing for the last ten years, uh, myself at NYU and at meta. And we've been making some really fast progress over the last three years. So I'm really happy about this.
Tom Mitchell I agree, that's very interesting direction. We could have a long conversation, but we're near the end of our time here. Let me just ask one question. It's always seemed to me one of the interesting questions about world model learning has to do with the clock, the time. How do you deal with time? So if I have something in my hand and I take the action of letting go of it, um, how do you describe the result of that? What is it a millisecond later? Is it when the world stabilizes and reaches steady state? Um, what's your quick take on that, that question.
Yann LeCun So my quick take on this, is that the solution to this, uh, uh, question is, is is in hierarchy. And again, it's the same concept as, as deep learning in the first place. So so we can make predictions. Like as humans we can make predictions about the the state of the world that are extremely accurate. Only if they are short term. Right. So if you make a short term prediction of, you know, I throw a ball at you and you can you can predict where the ball is going next and you can position yourself to grab it or whatever. This is relatively short term, but it's relatively accurate. Um, to make long term prediction, you cannot be that accurate, right? And so, uh, you have to have a representation of the state of the world that is more abstract. The longer term, a prediction you're trying to make, the more abstract the representation within which you make the prediction needs to be. And that suggests a whole architectural concept that I've been pushing for a number of years called called gipa. So what that means is joint embedding predictive architecture and this, this, this idea that if you have a system that you observes, uh, sensor data, be it video, um, you know, other sensor data, High dimensional, continuous, possibly noisy. If you want to make predictions, you cannot possibly predict every detail about the data. So the idea that you're going to use a generative model of the type that people use for language to predict what's going to happen in the video, for example, is is nuts. And it doesn't work like we've tried this many times with many different techniques. Um, but it's also useless because you don't need to predict every details of what goes on in the video. What you need to do is predict as much, uh, like retain as much information as possible in some abstract representation space so that you characterize as much as you can what goes on. Uh, but yet you can actually predict what goes on. Right. So there's a lot of things like, you know, imagine you're driving a, you know, on the road in the countryside and there is trees around, right? Uh, everything you care about is basically what the other cars are doing, and whether there's a pedestrian that's going to cross the street and, and various other things like this, maybe interesting things in the, in the landscape, but there's going to be like, you know, random motion of the leaves on the trees because there is wind and maybe there is a pond behind the trees and there is ripples on it, and there's no way you can predict this at the pixel level, and there's no way you want to devote any resources to actually doing this. Right. Um, so what you need to do is, you know, figure out an abstract representation of the world or the state of the world within for which you can make predictions. And we do this absolutely all the time in science. Okay. I mean, I think it's at the root of intelligence, really. But but I think, you know, we do this all the time in science. Like, I could describe everything that takes place between us right now in terms of, you know, quantum field theory or something. Okay. Uh, we need to we would need to know the wave function of the universe that contains new Jersey and and Pittsburgh. Um, to be able to make that prediction and have some gigantically large quantum computer to be able to make that prediction. So of course it's completely impractical. What do we do? We invent abstractions. Right? Particles, atoms, molecules in the living world it would be proteins, organelles, cells, organs, organisms, ecosystems. Societies. Right. And all of those levels of description, you know, as you go up the ladder, they're more and more abstract. There is less and less details about the underlying, uh, dynamics. But that is precisely what allows you to make predictions like, you know, uh, there is some description of what is taking place between us right now in terms of human psychology, which in principle can be reduced to quantum field theory. But we don't do this. That's crazy. right? Uh, and if we were to use quantum field theory, we would only be able to make predictions that are extremely short term. Um, whereas with psychology, maybe we can, you know, have some prediction about, you know, our state of mind tomorrow or something. Right. So, uh, so this idea that you have to learn abstract representation to make prediction is absolutely fundamental. And so that's the idea behind this architecture. Um, learn a representation, make predictions through representation space. And what that also allows you is to learn world models. Because now what you can do is now that you have a representation for the state of the world, and the way to predict what's the next state of the world is going to be, perhaps you can condition this on an action that you imagine taking, or an intervention you're doing on the on the environment, on the system that you are considering. So now you have a causal model, right? State of the world action. Next state of the world. Now I can do planning. Okay. So if I can train those architectures, uh, action conditioned. And I can come up with some planning procedure. I can train a hierarchical version of this that builds more and more abstract representations of the world. I can do hierarchical planning. So that's basically what I'm my research program for the next, uh, you know, until my brain turns to mush.
Tom Mitchell Okay. Well, um, looking forward to some of that. It's it's been great to have a chance to hear both your, your history and how you got to this point, but also, uh, what you think are some of the interesting things to be working on today. Yann LeCun, thank you so much for sharing with us.
Yann LeCun Uh, thank you, Tom. That was a real pleasure.