In this post, I'm going to comment on the ["The Scientist AI: Safe by Design, by Not Desiring"](https://lawzero.org/sites/default/files/publications/81/safetyargumentblogpost_2.pdf) paper, specifically on this part: >Consequence invariance. Our proposed approach is to sever any training signal that would inform the estimator about the downstream consequences of its predictions. If the training procedure cannot evaluate whether a given estimate led to favorable or unfavorable outcomes in the world, then it cannot learn to bias its outputs toward achieving preferred outcomes. We call this property consequence invariance: the learned estimator, and each update to it, must be invariant to any choice of utility function over downstream effects of predictions. In practice, this rules out training regimes akin to standard reinforcement learning, where a model interacts with an environment and receives feedback in service of an arbitrarily chosen utility that would depend on the effects of estimates. It also rules out objectives that optimize for anticipated future performance, as in model-based planning. In other words, the training signal must be myopic, concerned only with making an accurate prediction for the given query, given the data at hand. >A subtlety arises, however. A good estimator that models the world must be able to understand the consequences of actions, including the consequences of its own estimates, should those estimates influence the world. This is the problem of performative prediction: the act of forecasting can change the actual variables being forecast. If the estimator ignores this problem, it risks systematic error; if it does take performative prediction into account, then it may have an incentive to impact the world: it could select its outputs to bring about states of the world that make its estimates come true. One resolution to this problem draws from existing work on oracles [2], and the heart of it lies in evaluating counterfactuals. During training, we only ever pose questions in the spirit of: “I know you may give a different answer, but what would be the probability of event Y if you were to output q as the answer to this query?”, for different values of q. Ensuring accurate evaluation of these counterfactuals remains an open problem, but multiple such answers can sever the training signal about the future outcome of the world, thereby eliminating this potential expression of goals and any corresponding training signal. In the next section, we fill in some details about causal modeling, as is needed to make sense of counter-factual queries. # How I understand it LawZero wants to train a question-answering model where the answer is a number stating probability that a given statement is true (i.e. the question is given as a statement). During training, they avoid using training samples where the answer depends on the prediction itself. # How it can go wrong The training data doesn't contain any examples where the truth depends on the model answer itself, so when a model is asked a question where the truth depends on the model answer (I call such situations "self-referential situations"), then the output is quite unpredictable. Because the model answer will depend on what the model believes about what it will output, and the training data doesn't contain any information how the model behaves in such situation. ## Mistakes in training data Sometimes, mistakes in training data can make model learn some unintended function. For example, when we train a supervised learning model, that model represents some function that maps inputs to outputs (labels). In case of the question-answering model, we want the function to be "for a given question, give me the true answer to that question". However, let's suppose that there is a human mistake in the training data. As a result of such mistake, the learned function can be: "for a given question, give me the answer that a human would give". The training method can produce a model that represents that function because the training data is prepared by humans, and therefore the second function more accurately predicts the correct output (label) on the training dataset. There is a small difference between "what is true" and "what humans confidently believe is true". If the model is trained one time, and then used and never trained again, then maybe it's not a big problem. But if new training samples are added, then if the model learns the function "answers is what humans believe is true", then the answer might depend on the answer. If answer Y is not true, but that answer leads to manipulating human judgement in such way that humans will consider it true, then the model might give the answer Y. # What's the solution I think it would be better if instead of avoiding such self-referential situations in training data, we would intentionally include them in the training data in order to demonstrate how the model should handle such situations (it should handle them by making prediction that is good for the user). That way, we would have confidence that whenever the model is asked self-referential question, then the answer will be safe. However, I think that if we train it avoiding self-referential situations, then the probability that the model would develop some goals would still be very low. How it would answer self-referential questions would be unpredictable but it's unlikely that it would consistently follow some goals because the set of possible ways in which it could behave is very big, and the set of situations in which it pursues some goals is a very small subset of those. When I started writing this post, I thought this problem is more serious than it appears to me right now. But I reconsidered it at the time of writing this post, and I now believe that avoiding training on self-referential situations would also play out well, although I still believe that the proposed alternative is safer. # Conclusion Mainly, I would like the ScientistAI team to take the two following things into account: 1. There are questions that are possible to ask such that it's impossible give an answer without taking a side (a non-agentic answer) - some questions require an answer that influence the answer to the question. If you don't train the model how to handle such situations, then the outcome in case of such questions is uncertain (at least to me, at least partially). 2. I think it's not safe to make an assumption that it's possible to prepare a training dataset that doesn't contain any mistakes. And those mistakes might result in the model generalizing in unintended directions. %% # How can that go wrong It can go wrong as follows. From what I understand, the training method used is supervised learning, using some learning algorithm that produces non-agentic models. When we train a supervised learning model, that model represents some function that maps inputs to outputs (labels). In case of the question-answering model, we want the function to be "for a given question, give me the true answer to that question". However, let's suppose that there is a human mistake in the training data. As a result of such mistake, the learned function can be: "for a given question, give me the answer that a human would give". The training method can produce a model that represents that function because the training data is prepared by humans, and therefore the second function more accurately predicts the correct output (label) on the training dataset. There is a small difference between "what is true" and "what humans confidently feel is true". maybe talk about RL here If the training dataset is perfectly correct, for example because it contains only mathematically proven statements, then the first action also fits perfectly. But, for the model to be useful, we might need a large dataset, containing questions that are unrelated to math. Now, if we assume that the learned function is "for a given question, give me the answer that a human would give", then the following problem arises. Let's suppose that the model is smart enough that is capable of %%