Although an accidental sign flip in a reward function has indeed happened to people in the past, you just need some basic inspection and safeguarding of the reward function (or steering vector, or learned reward model, or dynamic representation of human moral reasoning) to drive the already-low probability of this and related errors down by orders of magnitude. This is something we're on track to have handled.

And even cartoon terrorists probably have better things to do with their cutting edge AI they have for some reason.

Reply

In the context of AI interp. What is a feature exactly?

Answer by Charlie SteinerMay 18, 202430

A different way of stating the usual Anthropic-esque concept of features that I find useful: Features are the things that are getting composed when a neural network is taking advantage of compositionality. This isn't begging the question, you just can't answer this without knowing about the data distribution and the computational strategy of the model after training.

For instance, the reason the neurons aren't always features, even though it's natural to write the activations (which then get "composed" into the inputs to the next layer) in the neuron basis, is because if your data only lies on a manifold in the space of all possible values, the local coordinates of that manifold might rarely line up with the neurons basis.

Reply

1

Instruction-following AGI is easier and more likely than value aligned AGI

Charlie Steiner2dΩ71410

Wow, that's pessimistic. So in the future you imagine, we could build AIs that promote the good of all humanity, we just won't because if a business built that AI it wouldn't make as much money?

Reply

Ilya Sutskever and Jan Leike resign from OpenAI [updated]

Charlie Steiner5d42

Well, one big reason is if they were prevented from doing the things they thought would constitute using their position of power to do good, or were otherwise made to feel that OpenAI wasn't a good environment for them.

Reply

The Intentional Stance, LLMs Edition

Charlie Steiner9d20

I think this gets deflationary if you think about it, though. Yes, you can apply the intentional stance to the thermostat, but (almost) nobody's going to get confused and start thinking the thermometer has more fancy abilities like long-term planning just because you say "it wants to keep the room at the right temperature." Even though you're using a single word "w.a.n.t." for both humans and thermostats, you don't get them mixed up, because your actual representation of what's going on still distinguishes them based on context. There's not just one intentional stance, there's an stance for thermostats and another for humans, and they make different predictions about behavior, even if they're similar enough that you can call them both intentional stances.

If you buy this, then suddenly applying an intentional stance to LLMs buys you a lot less predictive power, because even intentional stances have a ton of little variables in the mental model they come with, which we will naturally fill in as we learn a stance that works well for LLMs.

Reply

Shane Legg's necessary properties for every AGI Safety plan

Charlie Steiner18d40

I think this is a great idea, except that on easy mode "a good specification of values and ethics to follow" means a few pages of text (or even just the prompt "do good things"), while other times "a good specification of values" is a learning procedure that takes input from a broad sample of humanity, and has carefully-designed mechanisms that influence its generalization behavior in futuristic situations (probably trained on more datasets that had to be painstakingly collected), and has been engineered to work smoothly with the reasoning process and not encourage perverse behavior.

Reply

[Aspiration-based designs] 2. Formal framework, basic algorithm

Charlie Steiner20d41

So to sum up so far, the basic idea is to shoot for a specific expected value of something by stochastically combining policies that have expected values above and below the target. The policies to be combined should be picked from some "mostly safe" distribution rather being whatever policies are closest to the specific target, because the absolute closest policies might involve inner optimization for exactly that target, when we really want "do something reasonable that gets close to the target."

And the "aspiration updating" thing is a way to track which policy you think you're shooting for, in a way that you're hoping generalizes decently to cases where you have limited planning ability?

Reply

Improving Dictionary Learning with Gated Sparse Autoencoders

Charlie Steiner24dΩ340

Nice. I tried to do something similar (except making everything leaky with polynomial tails, so

y = (y+torch.sqrt(y**2+scale**2)) * (1+(y+threshold)/torch.sqrt((y+threshold)**2+scale**2)) / 4

where the first part (y+torch.sqrt(y**2+scale**2)) is a softplus, and the second part (1+(y+threshold)/torch.sqrt((y+threshold)**2+scale**2)) is a leaky cutoff at the value threshold.

But I don't think I got such clearly better results, so I'm going to have to read more thoroughly to see what else you were doing that I wasn't :)

Reply

Neural uncertainty estimation review article (for alignment)

Charlie Steiner1mo20

I'm actually not familiar with the nitty gritty of the LLM forecasting papers. But I'll happily give you some wild guessing :)

My blind guess is that the "obvious" stuff is already done (e.g. calibrating or fine-tuning single-token outputs on predictions about facts after the date of data collection), but not enough people are doing ensembling over different LLMs to improve calibration.

I also expect a lot of people prompting LLMs to give probabilities in natural language, and that clever people are already combining these with fine-tuning or post-hoc calibration. But I'd bet people aren't doing enough work to aggregate answers from lots of prompting methods, and then tuning the aggregation function based on the data.

Reply