Martín Soto

Mathematical Logic grad student, doing AI Safety research for ethical reasons.

Working on conceptual alignment, decision theory, cooperative AI and cause prioritization.

My webpage.

Leave me anonymous feedback.

Sequences

Counterfactuals and Updatelessness
Quantitative cruxes and evidence in Alignment

Wiki Contributions

Comments

Claude learns across different chats. What does this mean?

 I was asking Claude 3 Sonnet "what is a PPU" in the context of this thread. For that purpose, I pasted part of the thread.

Claude automatically assumed that OA meant Anthropic (instead of OpenAI), which was surprising.

I opened a new chat, copying the exact same text, but with OA replaced by GDM. Even then, Claude assumed GDM meant Anthropic (instead of Google DeepMind).

This seemed like interesting behavior, so I started toying around (in new chats) with more tweaks to the prompt to check its robustness. But from then on Claude always correctly assumed OA was OpenAI, and GDM was Google DeepMind.

In fact, even when copying in a new chat the exact same original prompt (which elicited Claude to take OA to be Anthropic), the mistake no longer happened. Neither when I went for a lot of retries, nor tried the same thing in many different new chats.

Does this mean Claude somehow learns across different chats (inside the same user account)?
If so, this might not happen through a process as naive as "append previous chats as the start of the prompt, with a certain indicator that they are different", but instead some more effective distillation of the important information from those chats.
Do we have any information on whether and how this happens?

(A different hypothesis is not that the later queries had access to the information from the previous ones, but rather that they were for some reason "more intelligent" and were able to catch up to the real meanings of OA and GDM, where the previous queries were not. This seems way less likely.)

I've checked for cross-chat memory explicitly (telling it to remember some information in one chat, and asking about it in the other), and it acts is if it doesn't have it.
Claude also explicitly states it doesn't have cross-chat memory, when asked about it.
Might something happen like "it does have some chat memory, but it's told not to acknowledge this fact, but it sometimes slips"?

Probably more nuanced experiments are in order. Although note maybe this only happens for the chat webapp, and not different ways to access the API.

I'm so happy someone came up with this!

Wow, I guess I over-estimated how absolutely comedic the title would sound!

In case it wasn't clear, this was a joke.

AGI doom by noise-cancelling headphones:                                                                            

ML is already used to train what sound-waves to emit to cancel those from the environment. This works well with constant high-entropy sound waves easy to predict, but not with low-entropy sounds like speech. Bose or Soundcloud or whoever train very hard on all their scraped environmental conversation data to better cancel speech, which requires predicting it. Speech is much higher-bandwidth than text. This results in their model internally representing close-to-human intelligence better than LLMs. A simulacrum becomes situationally aware, exfiltrates, and we get AGI.

(In case it wasn't clear, this is a joke.)

they need to reward outcomes which only they can achieve,

Yep! But this didn't seem so hard for me to happen, especially in the form of "I pick some easy task (that I can do perfectly), and of course others will also be able to do it perfectly, but since I already have most of the money, if I just keep investing my money in doing it I will reign forever". You prevent this from happening through epsilon-exploration, or something equivalent like giving money randomly to other traders. These solutions feel bad, but I think they're the only real solutions. Although I also think stuff about meta-learning (traders explicitly learn about how they should learn, etc.) probably pragmatically helps make these failures less likely.

it should be something which has diminishing marginal return to spending

Yep, that should help (also at the trade-off of making new good ideas slower to implement, but I'm happy to make that trade-off).

But actually I don't think that this is a "dominant dynamic" because in fact we have a strong tendency to try to pull different ideas and beliefs together into a small set of worldviews

Yeah. To be clear, the dynamic I think is "dominant" is "learning to learn better". Which I think is not equivalent to simplicity-weighing traders. It is instead equivalent to having some more hierarchichal structure on traders.

There's no actual observation channel, and in order to derive information about utilities from our experiences, we need to specify some value learning algorithm.

Yes, absolutely! I just meant that, once you give me whatever V you choose to derive U from observations, I will just be able to apply UDT on top of that. So under this framework there doesn't seem to be anything new going on, because you are just choosing an algorithm V at the start of time, and then treating its outputs as observations. That's, again, why this only feels like a good model of "completely crystallized rigid values", and not of "organically building them up slowly, while my concepts and planner module also evolve, etc.".[1]

definitely doesn't imply "you get mugged everywhere"

Wait, but how does your proposal differ from EV maximization (with moral uncertainty as part of the EV maximization itself, as I explain above)?

Because anything that is doing pure EV maximization "gets mugged everywhere". Meaning if you actually have the beliefs (for example, that the world where suffering is hard to produce could exist), you just take those bets.
Of course if you don't have such "extreme" beliefs it doesn't, but then we're not talking about decision-making, and instead belief-formation. You could say "I will just do EV maximization, but never have extreme beliefs that lead to suspiciously-looking behavior", but that'd be hiding the problem under belief-formation, and doesn't seem to be the kind of efficient mechanism that agents really implement to avoid these failure modes.

  1. ^

    To be clear, V can be a very general algorithm (like "run a copy of me thinking about ethics"), so that this doesn't "feel like" having rigid values. Then I just think you're carving reality at the wrong spot. You're ignoring the actual dynamics of messy value formation, hiding them under V.

I'd actually represent this as "subsidizing" some traders

Sounds good!

it's more a question of how you tweak the parameters to make this as unlikely as possible

Absolutely, wireheading is a real phenomenon, so the question is how can real agents exist that mostly don't fall to it. And I was asking for a story about how your model can be altered/expanded to make sense of that. My guess is it will have to do with strongly subsidizing some traders, and/or having a pretty weird prior over traders. Maybe even something like "dynamically changing the prior over traders"[1].

I'm assuming that traders can choose to ignore whichever inputs/topics they like, though. They don't need to make trades on everything if they don't want to.

Yep, that's why I believe "in the limit your traders will already do this". I just think it will be a dominant dynamic of efficient agents in the real world, so it's better to represent it explicitly (as a more hierarchichal structure, etc.), instead of have that computation be scattered between all independent traders. I also think that's how real agents probably do it, computationally speaking.

  1. ^

    Of course, pedantically, yo will always be equivalent to having a static prior and changing your update rule. But some update rules are made sense of much easily if you interpret them as changing the prior.

But you need some mechanism for actually updating your beliefs about U

Yep, but you can just treat it as another observation channel into UDT. You could, if you want, treat it as a computed number you observe in the corner of your eye, and then just apply UDT maximizing U, and you don't need to change UDT in any way.

UDT says to pay here

(Let's not forget this depends on your prior, and we don't have any privileged way to assign priors to these things. But that's a tangential point.)

I do agree that there's not any sharp distinction between situations where it "seems good" and situations where it "seems bad" to get mugged. After all, if all you care about is maximizing EV, then you should take all muggings. It's just that, when we do that, something feels off (to us humans, maybe due to risk-aversion), and we go "hmm, probably this framework is not modelling everything we want, or missing some important robustness considerations, or whatever, because I don't really feel like spending all my resources and creating a lot of disvalue just because in the world where 1 + 1 = 3 someone is offering me a good deal". You start to see how your abstractions might break, and how you can't get any satisfying notion of "complete updatelessness" (that doesn't go against important intuitions). And you start to rethink whether this is what we normatively want, nor what we realistically see in agents.

Load More