Martian - Theoretical Explorer

Developing safe and highly-performing AI agents that are not based on reinforcement learning. The problem that it's addressing is AI alignment (in general). That would also address interpretability because if you have an aligned agent, then if you want to know why it's doing something, then you can just ask it. If it's capable enough, then it can also be asked to interpret its own internals (e.g. weights in the neural network) or to automate interpretability research. My proposed approach is to develop a safe, aligned and highly-performing agent. That agent won't use reinforcement learning. Instead, it will use a non-agentic question-answering model to predict what action the human operator will consider to be the right action in the hindsight (looking back from the future), and then take that action programmatically. That is expected be safer than using reinforcement learning. Because reinforcement learning takes the action that maximizes the reward, and reward is an measurement of what we want (and not exactly what we want), and therefore it's an imperfect proxy for the value of what we want. On the other hand, the action that we will consider best in the hindsight is an imperfect proxy for the best action. But if reinforcement learning goes wrong, then the agent actively maximizes wrong goal (the measurement of what we want, instead of what we want) which results in a conflict of interest between the agent and the human. And if the action that we consider to be right in the hindsight goes wrong, then we just take a suboptimal action, without agent being in conflict with us. It's important that the training method that we use must be a method that produces a non-agentic model. I make an important distinction between a non-agentic prediction model and an agentic prediction model. The first one achieves the goal of minimizing the prediction error solely by making the most likely prediction. The second one can minimize the prediction error also by influencing the world so that its prediction will happen to be correct. Reinforcement learning has other advantages like: being able to learn from experience and being able to learn exactly what is needed. But it's possible to achieve that with non-agentic test-time learning. I have an idea how to do that (I don't have a space to describe that in 500 words), but I don't know all the details, so figuring out the details would be part of the project. For measuring success, we need to measure two things: 1. Performance - by comparing performance with other models/algorithms. 2. Alignment - it can be tested empirically if a model is non-agentic. Then, it can be theoretically justified that if the underlying model is non-agentic, then the agent will behave in an aligned way. The expected output is a code that allows to train an artificial intelligence that is aligned. And an aligned artificial intelligence agent (if I receive sufficient funds for computational power). The impact is solving AI alignment. %% This page will contain project description for Martian Interpretability prize. The page is in progress. Please come back later. %% %% Link to edit: https://docs.google.com/forms/d/e/1FAIpQLSeiJ99OxBKGFO7PaN3QcXg1mzUaa9ldOP6YOjV3enk6v3YIgQ/viewform?edit2=2_ABaOnufW5gTsDt5r5y7__g6BcPs1rQNC1D-c7gFEieiGGarWjnnRgo0vkSn3kIQ00gBsvQ4 %%