Nathan Helm-Burger

AI alignment researcher, ML engineer. Masters in Neuroscience.

I believe that cheap and broadly competent AGI is attainable and will be built soon. This leads me to have timelines of around 2024-2027. Here's an interview I gave recently about my current research agenda. I think the best path forward to alignment is through safe, contained testing on models designed from the ground up for alignability trained on censored data (simulations with no mention of humans or computer technology). I think that current ML mainstream technology is close to a threshold of competence beyond which it will be capable of recursive self-improvement, and I think that this automated process will mine neuroscience for insights, and quickly become far more effective and efficient. I think it would be quite bad for humanity if this happened in an uncontrolled, uncensored, un-sandboxed situation. So I am trying to warn the world about this possibility. 

See my prediction markets here:

 https://manifold.markets/NathanHelmBurger/will-gpt5-be-capable-of-recursive-s?r=TmF0aGFuSGVsbUJ1cmdlcg 

I also think that current AI models pose misuse risks, which may continue to get worse as models get more capable, and that this could potentially result in catastrophic suffering if we fail to regulate this.

I now work for SecureBio on AI-Evals.

relevant quote: 

"There is a powerful effect to making a goal into someone’s full-time job: it becomes their identity. Safety engineering became its own subdiscipline, and these engineers saw it as their professional duty to reduce injury rates. They bristled at the suggestion that accidents were largely unavoidable, coming to suspect the opposite: that almost all accidents were avoidable, given the right tools, environment, and training." https://www.lesswrong.com/posts/DQKgYhEYP86PLW7tZ/how-factories-were-made-safe 

Wiki Contributions

Comments

I absolutely agree that it makes more sense to fund the person (or team) rather than the project. I think that it makes sense to evaluate a person's current best idea, or top few ideas when trying to decide whether they are worth funding.

Ideally, yes, I think it'd be great if the funders explicitly gave the person permission to pivot so long as their goal of making aligned AI remained the same.

Maybe a funder would feel better about this if they had the option to reevaluate funding the researcher after a significant pivot?

So am I. So are a lot of would-be researchers. There are many people who think they have a shot at doing this. Most are probably wrong. I'm not saying an org is a good solution for him or me. It would have to be an org willing to encompass and support the things he had in mind. Same with me. I'm not sure such orgs exist for either of us.

With a convincing track-record, one can apply for funding to found or co-found a new org based on your ideas. That's a very high bar to clear though.

The FAR AI org might be an adequate solution? They are an organization for coordinating independent researchers.

Yeah, so... I find myself feeling like I have some things in common with the post author's situation. I don't think "work for free at an alignment org" is really an option? I don't know about any alignment orgs offering unpaid internships. An unpaid worker isn't free for an org, you still need to coordinate them, assess their output, etc. The issues with team bloat and how much to try to integrate a volunteer are substantial.

I wish I had someone I could work with on my personal alignment agenda, but it's not necessarily easy to find someone interested enough in the same topic and trustworthy enough to want to commit to working with them.

Which brings up another issue. Research which has potential capabilities side-effects is always going to be a temptation to some degree. How can potential collaborators or grant makers trust that researchers will resist the temptation to cash in on powerful advances and also prevent the ideas from leaking? If the ideas are unsafe to publish then the ideas can't contribute piecemeal to the field of alignment research, they have to be valuable alone. That places a much higher bar for success. Which makes it seem like a riskier bet from the perspective of funders. One way to partially ameliorate this issue of trust is having Orgs/Companies. They can thoroughly investigate a person's competence and trustworthiness. Then the person can potentially contribute to a variety of different projects once onboarded. Management can supervise and ensure that individual contributors are acting in alignment with the company's values and rules. That's a hard thing for a grant-making institution to do. They can't afford that level of initial evaluation, much less the ongoing supervision and technical guidance. So... Yeah. Tougher problem than it seems on first glance.

Well, the nice thing about at least agreeing on using e as the notation means its easy to understand variants which prefer subsets of exponents. 500e8, 50e9, and 5e10 all are reasonably mutually intelligible. I think sticking to a subset of exponents does feel intuitive for talking about numbers frequently encountered in everyday life, but seems a little contrived when talking about large numbers. 4e977 seems to me like it isn't much easier to understand when written as 40e976 or 400e975.

Yeah, that's probably the rationale

Yeah, agreed. Also, using just an e makes it much easier to type on a phone keyboard.

There are also other variants, like ee and EE. And also sometimes you see a variant which uses only multiples of three as the exponent. I think it's called engineering notation instead of scientific notation? So like 1e3, 50e3, 700e6, 2e9. I also like this version less.

As I asked someone who challenged this point on Twitter, if you think you have a test that is lighter touch or more accurate than the compute threshold for determining where we need to monitor for potential dangers, then what is the proposal? So far, the only reasonable alternative I have heard is no alternative at all. Everyone seems to understand that ‘use benchmark scores’ would be worse.

 

As someone who thinks it's a bad idea to try to write legislation focused on compute thresholds, because I believe that compute thresholds will become outdated suddenly in the not so distant future... I would far rather that legislators say something along the lines of, "We do not currently have a good way to measure how risky a given AI system is. As a first step we are going to commission the creation of a battery of tests, some public and some classified, to thoroughly evaluate a given system. We will require all companies wishing to do business in our country to submit their models to our examination."

I've been working for the past 8 months on trying to create good evaluations of AI biorisk. My team's initial attempts to do this were met with the accusation that our evaluations were insufficiently precise and objective. That's not wrong. They were the best we could do at short notice, but far from adequate. We've been working hard since then to develop better evals, thorough and objective enough to convince skeptics. But this isn't easy. It's a labor intensive process, and we can't afford much labor. The US Federal Government CAN afford to hire a bunch of scientists to design and author and review thousands of in-depth questions.

Criticisms of the biorisk evals so far have pointed out:
 
'Yes, the models show a lot of book knowledge about virology and genetic engineering, but that's because reciting facts from papers and textbooks plays to their strengths. Their high scores on such tests don't imply the same level of understanding or skill or utility as would similarly high scores from a human expert. This fails to evaluate the most important bottlenecks such as the detailed tacit knowledge of hands-on wetlab skills.'

Sure, we need to check for both. But without adequate funding, how can we be expected to be able to hire people to go set up fake lab experiments and photograph and videotape them going wrong to create tests to see if models like GPT-4o can help troubleshoot well enough to be a significant uplift for inexpert lab workers? That's inherently a time-intensive and material-intensive sort of test to create! And until we do, and then show that the AI models get low scores on those exams, we are operating under uncertainty about the models' skills. Our critics assume the models are currently incapable at this and will remain so, but they offer no proof of that. They are not scrambling themselves to create the tests which could prove the models' incapability. Given the novel territory rapidly being broken by new models, we should start considering new models 'dangerous until proven safe' not 'innocent until proven guilty'. 

My vision of model regulation

To be clear, my goal is not to stifle model development and release, or harm the open-source community. I expect the process of evaluating models to be something we can do cheaply, automatically, quickly. You submit your model weights and code through a web form, and get back a thumbs up within minutes. It's free and easy. You never see anyone failing to pass. The first failure will very likely occur when one of the largest labs submits their latest experimental model's checkpoint, long before they'd even considered releasing it publicly, just to satisfy their curiosity. And when that day comes, we will all be immensely grateful that we had the safety checks in place.

 The expense of designing, creating, and operating this will be substantial. But it is in the service of preventing a national security catastrophe, so it seems to me like a very worthwhile expenditure of taxpayer funds.

I like Seth's thoughts on this, and I do think that Seth's proposal and Max's proposal do end up pointing at a very similar path. I do think that Max has some valuable insights explained in his more detailed Corrigibility-as-a-target theory which aren't covered here. 

For me, I found it helpful seeing Seth's take evolve separately from Max's, as having them both independently come to similar ideas made me feel more confident about the ideas being valuable.

My answer to that is currently in the form of a detailed 2 hour lecture with a bibliography that has dozens of academic papers in it, which I only present to people that I'm quite confident aren't going to spread the details. It's a hard thing to discuss in detail without sharing capabilities thoughts. If I don't give details or cite sources, then... it's just, like, my opinion, man. So my unsupported opinion is all I have to offer publicly. If you'd like to bet on it, I'm open to showing my confidence in my opinion by betting that the world turns out how I expect it to.

Sounds neat. I think it would make more sense to frame a 'public non-legislation-enacting non-official-electing vote' as a public opinion poll. Politicians pay attention to opinion polls! Opinion polls matter! Framing an opinion poll as a weird sort of transferable ineffective vote is just confusing and detracts from the genuine value in the idea.

 I bet I'd enjoy answering some of these opinion polls. Too bad I don't and won't have a Twitter/X account. This would seem much more interesting to me if it could make a digital twin for me from my writing here on LessWrong, or out of arbitrary documents I uploaded to my account on your site for that express purpose. 

Load More