
AI safety & alignment researcher

Wiki Contributions


Will read this in detail later when I can, but on first skim -- I've seen you draw that conclusion in earlier comments. Are you assuming you yourself will finally be deanonymized soon? No pressure to answer, of course; it's a pretty personal question, and answering might itself give away a bit or two.

On reflection I somewhat endorse pointing the risk out after discovering it, in the spirit of open collaboration, as you did. It was just really frustrating when all my experiments suddenly broke for no apparent reason. But that's mostly on OpenAI for not announcing the change to their API (other than emails sent to some few people). Apologies for grouching in your direction.

I'm aware of the paper because of the impact it had. I might personally not have chosen to draw their attention to the issue, since the main effect seems to be making some research significantly more difficult, and I haven't heard of any attempts to deliberately exfiltrate weights that this would be preventing.

Interesting! Tough to test at scale, though, or score in any automated way (which is something I'm looking for in my approaches, although I realize you may not be).

  1. Gwern's theories make sense to me. The data was roughly 50/50 on <= 30 vs > 30, so that's where I split it (and I'm only asking the model to pick one of those two options). Sexuality in the dataset is just male/female; they must have added the other options later (35829 male, 24117 female, and 2 blanks which I ignored). Agreed that this is very much a lower bound, also becase I applied zero optimization to the system prompt and user prompts. This is 'if you do the simplest possible thing, how good is it?'
  2. No, unfortunately it's all lowercased already in the dataset.
  3. I agree! Dating site data is somewhat easy mode. I compared gender accuracy on the Persuade 2.0 corpus of students writing essays on a fixed topic, which I consider very much hard mode, and it was still 80% accurate. So I do think it's getting some advantage from being in easy mode but not that much. I'll note also that I'm removing a bunch of words that are giveaways for gender, and it only lost 2 percentage points of accuracy. So I do think it's mostly working from implicit cues and distributional differences here rather than easy giveaways. Staab et al (thanks @gwern for pointing that paper out to me) looks more at explicit cues and compares to human investigators looking for explicit cues, so you may find that interesting as well.

Absolutely! @jozdien recounting those anecdotes was one of the sparks for this research, as was janus showing in the comments that the base model could confidently identify gwern. (I see I've inexplicably failed to thank Arun at the end of my post, need to fix that).

Interestingly, I was able to easily reproduce the gwern identification using the public model, so it seems clear that these capabilities are not entirely RLHFed away, although they may be somewhat impaired.

Oh thanks, I'd missed that somehow & thought that only the temp mattered for that.

That used to work, but as of March you can only get the pre-logit_bias logprobs back. They didn't announce the change, but it's discussed in the OpenAI forums eg here. I noticed the change when all my code suddenly broke; you can still see remnants of that approach in the code.

That certainly seems plausible -- it would be interesting to compare to a base model at some point, although with recent changes to the OpenAI API, I'm not sure if there would be a good way to pull the right token probabilities out.

@Jessica Rumbelow also suggested that that debiasing process could be a reason why there weren't significant score differences between the main model tested, older GPT-3.5, and the newest GPT-4.

Load More