What You've Heard About Q* is Bull**** - It's Not AGI

What You've Heard About Q* is Bull**** - It's Not AGI
https://www.youtube.com/watch?v=KsLlF9MXK54

Transcript:
(00:00) AGI has not been cracked this isn't really step-by-step reasoning this isn't planning ultimately you're still relying on a context window of characters to the model there is no step they use special characters to create a notion that there is a break and it learns that that special break character means that there is a next step in the same way that a a space or a period indicates that there is a break semantically but it is not that there is any real sense of time it's still doing the same thing that llms are already doing which
(00:40) is token by token by token by token outputting things and it produces symbolic representations that are meaningful to us I I I promise you that qar is it's a variation of reinforcement learning learning with human feedback and that process doesn't even play out in the end product it is only in the process of training it that this even occurs so when you deploy the model in its final State qar isn't even running that's just the process by which you update the parameters to get to the model that you want that does the things
(01:24) that you want but it's just sampling from the collection of human knowledge and math is a very structured thing and all that it needs is already in the model it's just sorting through the noise there is a number of leaks regarding something called qar and these leaks have taken root because there was a lot of drama at open Ai and a lot of people are very uncertain as to why that happened and they really want to know why and so they have been zooming in on the company trying to understand what's been going on there have been two leaks
(02:02) one which is reasonable and one which does not deserve amplification the first is the basic claim that they were training on some algorithm called qar and that was uh working on solving math problems and they claimed they got to 100% on the math test this is exciting it is exciting because because they were struggling a lot on math benchmarks the 3 GPT 3.
(02:38) 5 supposedly only got about like 40% uh in may they released a paper that got to about 78% and with this approach they got to about 100% but that does not mean that the process of training on this task has enabled it to do things outside the domain of what humans have directed it to to do it does not mean that it isn't creating new math it does not mean that it can it has access to infinite amount of compute to be able to continue processing when there is uncertainty about a problem does not mean it has the capacities to self-direct it is still
(03:20) the foundational same Paradigm of how large language models work which is that you take uh a lot of data you do a initial pre trining process where you just predict the next token and that takes a very very long time and then you do a second round of fine-tuning of that model where you refine the parameters based off of some type of human feedback to get it to align more with our expectations of how it should behave and how it should function to make it play nice okay so as I was saying the thing that is very
(04:03) likely true is what is in alignment with what they have already published publicly about they have in May of uh this year they published a paper called let's verify step by step about something called process reward models which are essentially um instead of going from math problem to the answer right away we show our work right we go step by step by step this is just this helps for humans um it's basically the machine learning equivalent of saying sound it out you know when you have a word that you can't pronounce as a child
(04:42) you say sound it out break it up into parts right and this shouldn't be surprising if you are working with chat GPT uh if you have wrong answers in the context window it starts to do worse a lot of people ask me how I get such quality answers out of gp4 when it's not working for them and I just say I just delete the wrong answers because if you have a long stream of it getting it wrong what it starts to do is model a conversation with somebody who doesn't have the right answers right so if you have a long stream of it getting it
(05:20) right it getting it right it getting it right and it's only generating a small amount the next part it's going to do a lot better um and they have been working on this uh for a while to uh improve the capacity of GPT to do math so let's sort of scroll back and talk about how they have been training the GPT 4 to do all the things that you're familiar with and how they have been re-approaching this problem to train for Math and how those problems kind of differ okay so the primary difference between language and
(06:10) math is math is usually there's just one very specific answer right there is not a lot of room for semantic vagueness when we're talking about alignment or talking about textual outputs there are a lot of things that can satisfy uh the people using the code that are the output but with math there is one right answer there's very little room for error okay so when we are talking about how we refine these models um we start with this big pre-training we're just predicting the next token and by predicting the next token we're taking
(06:53) all of human knowledge all of the noise and all of the signal all of the contradictions and conflict with in um you know Collective human knowledge all of the human Shadow and all of the human light and it's all in one place and it is trying to simply just predict the next token that does not get us close enough to be a functional tool that we can use so what they have done is this second round of fine-tuning part of it involves just uh human Cur data sets where they are doing uh continued predicting the next token um with a
(07:38) focus on we expect it to behave this way but the big thing that has been the Breakthrough is something called reinforcement learning with human feedback the way that reinforcement learning with human feedback works is that you have humans um basically rate in order of quality uh a number of outputs from the model you have a singular prompt and it'll give multiple responses and people will say which is the best right and then to in order to train on that rapidly what they do is they train something called a reward
(08:15) model the reward model is being trained to predict how people will respond to the reward and then what they do is they can very rapidly use that as a proxy for human response so they can have um it produce many different answers the reward model will predict what humans would say whether it's a good or bad response and then it will update its parameters in relationship to that reward signal that works very well on you know these semantic things it does not seem to work so well on math and it doesn't also doesn't seem to work very well on
(09:00) programming CU you know gp4 is a dog programmer and it's it drives me crazy it doesn't work very well um so what they have done is in May they released a paper called let's verify step by step um there this process involves multiple steps of reasoning that process uh involved scoring each step in the process by the quality of it and that is done a similar way they train a reward model uh on human answers um where there they are ranking the quality of each step and then the that uses that serves as a proxy for the
(09:49) quality of each step so when it's trying to get from point A which is the math problem to point B which is the math answer and you have these multiple steps if it ever produces if the reward model ever says oh this is a very bad step it can stop right so it's so you can basically Traverse step by step until you get to what you think is a bad answer and this got them from 40% in um with gpt3 to about 78% on these math benchmarks and the claim now is that qar is getting them to 100% I think this is probably true um
(10:39) but I do not think that because it is now able to do math better that it suddenly is solving anything out of domain or that this can generalize to doing tasks that has never been done before in particular what we have within the math domain is that you have a very clear success signal that's very important when you are trying to train a model if you look at something like alphao it's a game there's a win state so you can take the humans out of the situation and you can have self-play you can have the model play against itself many many many
(11:22) many many games and because there's an ultimate signal that says you have succeeded it can just keep going but language is not really like that language is a reflection of the real world which the model does not have access to it is trying to create a representation of the underlying belief space um a sort of world model of belief that helps it uh efficiently predict the next token and by all means there is more than enough knowledge within the training set for it to be able to do math um up to 100% it's just that you know
(12:07) math is hard there's a lot of uh wrong answers out there and so if you just try to predict the next token you're also modeling everybody who does bad math you're also modeling everybody who got it wrong but there's more than enough information to be able to Traverse through that semantic space from the point where you have the problem to the point that you have the correct answer if you can verify step by step by step by step and you can improve the process of getting from point A within this large semantic space which is the
(12:44) original model to the end point which is the correct answer in order to do that you have to have a fine tuning um usually a reinforcement learning um model now what I'll what what I'll say is that there is something that is called let's see if I can bring this up there is something a a paper that was released in March of 2023 called qar which is a combination of uh Q learning deep Q networks and AAR which involves um a similar process but one of the things that I really want to point out is that this isn't really
(13:30) stepbystep reasoning this isn't planning right because ultimately you're still relying on a context window of characters to the model there is no step in reality there is no breakage point that says this is one message this is the next this is one step this is the next they use special characters to create um a notion that there is a break and it learns that that special break character means that there is a next step in the same way that a a space or a period indicates that there is a break semantically but it is not that there is
(14:08) any real sense of time it's still in the end when you train this large language model on qar if it exists which I think it probably does it's still doing the same thing that llms are already doing which is token by token by token by token outputting things and it produces symbolic representations that are meaningful to us right that are reflections of our own understanding of the world and because we make a lot of judgments about intelligence by this symbolic knowledge we see something in that that represents some true understanding but
(14:58) that's not really that's not artificial general intelligence at all that's just tracing a pathway between a starting point and an ending point based off of the collection of human knowledge that has already been established right it's not doing something like stepping far outside of the space of reasoning to find a brand new type of math it is not doing something um where there is not a clear um um Target right when we think about you or I when we address the world there is no inherent signal that says okay you are succeeding or you are
(15:40) failing you basically feel it out you Vibe it out your attention flows different ways and you have to decide whether you are successful or not based off of that like in my process of reviewing this there is some general intelligence that I'm using I am having to keep going there's no particular end point of my processing until I feel a sense that I have plugged all the pieces together that when I flow through it every single time there is no error right there's no particular point at which I could say oh this is the
(16:24) termination point no I just kept going day by day by day I kept thinking about it feeling like maybe I've got it maybe I don't got it people kept posting new things that I would watch or I would read and I would be like oh my God they just do not care in the same way that I do about this being true they just want to get it out as soon as possible they want to say whatever people are speculating and as long as they add a little asterisk that says oh well take this all with the grain of salt then it's fine to talk about it's fine to put
(16:53) out misinformation into the world right so again to solving math problems mean that we have AGI well think about it this way what they've basically done is reinvent a calculator in symbolic language a lot of these math problems that they are solving could already be done if at with far less energy in calculation we just have this expectation that computers do math well so that when large language models couldn't do math well we're like okay that's weird and so now they seem to be able to do math well because they have
(17:30) this different process of training it it does seem to be multi-step reasoning where there is like a search graph in in a similar way where there is um an evaluation of the value of each particular step down that path it does seem like we're trying to reach a uh a shorter pathway between those steps But ultimately we're able to do that because we can explicitly say the correct answer and because there is a very structured nature to math that if you have the rules that you can always Trace through to the answer when we're talking about
(18:19) inventing brand new things or looking at a situation in a very generic sense and thinking about time and our relationship to time and our relationship to uncertainty it is not as clear large language models still do not have any realistic uh relationship with time they are not embeded in a timeline they are not uh enabled to have continuous access to keep Computing and they are still at their final step just acting in response to a human prompt It's Always initialized by something that a human has said it is never just going off on its own and then
(19:12) just continuing to go forever and even if you set up some sort of loop that's doing that it's because you have set a um a Target in natural text that you're saying this is where you should be going this is where you should continue to be operating on it's not operating in relationship to the world it's not doing real planning right this isn't actual planning this is it's about as much of planning as you just writing out steps to plan in GPT saying can you make a plan for me that's about as much as it is so to
(19:56) conclude we've not gotten to a point of of AGI we have not cracked encryption we have not seen a model take initiative to do all sorts of things that we did not request it to do the model is not self-directing in any way the foundational architecture of large language models Remains the Same it remains that the ultimate product that you're using is you provide it with a prompt in a response to you and to achieve that they have a first set of training which involves predicting the next token and then they have a second set of training that
(20:40) involves um human feedback in some way and they realized that the reinforcement learning with human feedback where you are raing in order the quality of the answers does not make as much sense for math problems um because there's really one only one answer so they take the value of that and apply it to the steps between where they have human um labelers label the quality of each step that it's taking to get to the end point right and so it's able to generate all of these potential pathways through semantic space through
(21:26) the model to get to the end State this is the outcome supervised um reward model which versus the process supervised reward model the outcome is you're only giving feedback whether it got the right answer the process supervised model is you are giving it feedback on each and every step this reads very very very very very closely to the notion of qar and there's another paper called uh Dar it it's like a multi-step reasoning thing it seems to be in the same vein uh of the problem which is that we're going step by step
(22:08) but again it's really just the notion that if you have a large context of what the answers already are right then the next step is not that hard and it takes many reason steps to get from the problem to the answer but in the end ultimately what we're doing is we're showing the work and because we're showing the work people can evaluate and know how it's going through that process it's it's it's providing interpret interpretability to the process by which it is achieving that goal but it doesn't mean that there is some sort of
(22:48) understanding it doesn't mean that there's some self-directed uh capacity and it doesn't mean that you need to be afraid and it doesn't mean that humanity is in danger for the reason that people say I still maintain that humanity is in danger and it's for none of the reasons that people are saying you are not in danger because it's going to suddenly become self-aware and kill us all you're in danger because we are evolved to perceive a sense of the other through very simple mechanisms and we are hacking those mechanisms
(23:29) we are hacking our sense of reality itself we are creating generative representations that are confusing to us right where we we can't really tell um or understand without a lot of knowledge why it's doing this I a little bit of a philosophical star on this which is that I think the reason people have problem with this is because they identify so closely with their thoughts they view themselves in the product of their own intelligence as their thoughts they don't view their thoughts as being something arise from their subconscious
(24:10) through some neural mechanism and then they as a higher integrated intelligence which we don't fully understand how that happens yet is responding to that interface the interface of thought right what we seem to have invented uh with all these generative mod models is subconscious thought the part that we don't identify with the part that we call our brain and there is a way to detach yourself from your own thoughts in a way that you no longer identify them with them as much mindfulness meditation is very valuable
(24:52) for example if your mind is very self-critical of itself if your thoughts tell you you're stupid or dumb or something like that that or if you have OCD then you can change your own relationship to the intelligence that is fed forward to you by the underlying apparatus right but that does not mean that the fact that we have recreated the sense of subconscious thought through the artifacts of our thoughts over time does not mean that we've recreated Consciousness does not mean that there is an identity within the model that
(25:36) is placing itself in existence in relationship to time it doesn't mean that it wants to protect itself it doesn't mean that it's embedded in some sort of substrate and has its own motivations and its own goals all of the goals are still from us okay so you can stop freaking out I I I promise you that that qar is just a variation of these process REM reward models it's a variation of reinforcement learning with human feedback and that process doesn't even play out in the end product it is only in the process of training it that
(26:17) this even occurs so when you deploy the model in its final State qar isn't even running that's just the process by which you up update the parameters to get the model that you want that does the things that you want but it's just sampling from the collection of human knowledge and math is a very structured thing and all that it needs is already in the model it's just sorting through the noise and it's just sounding Out the word it's just showing your work and that's cool and that's a a leap forward but it's not
(27:01) AGI and it is unlikely that they have seen something incredibly terrifying within the company that they're not talking about that they're not showing it is more likely that they are just not communicating well that there are factions with an open Ai and they can't see eye to eye and they can't integrate those dis disperate perceptions and their governance struct is very uh outdated so thank you for listening I hope this all made sense to you and I hope you have a beautiful [Music] day [Music]