This post is a not a so secret analogy for the AI Alignment problem. Via a fictional dialog, Eliezer explores and counters common questions to the Rocket Alignment Problem as approached by the Mathematics of Intentional Rocketry Institute.
MIRI researchers will tell you they're worried that "right now, nobody can tell you how to point your rocket’s nose such that it goes to the moon, nor indeed any prespecified celestial destination."
I don't know if this is a well known argument that has already been responded to. If it is, just delete the post.
An implicit assumption most people make when discussion takeover risk, is that any misaligned agent will stave off enacting takeover plans with less than very high probability of success. If the Agent has an immediate takeover plan with 20% chance of success, but waiting a month would allow it to better its position to the point where the takeover plan it could enact then had a 99% chance of success, it would do so.
This assumption seems to break in multi-polar scenarios where you have many AI labs at similar levels of capabilities. As a toy example. If you have 20 AI labs, each having developed...
Epistemic Status: Musing and speculation, but I think there's a real thing here.
When I was a kid, a friend of mine had a tree fort. If you've never seen such a fort, imagine a series of wooden boards secured to a tree, creating a platform about fifteen feet off the ground where you can sit or stand and walk around the tree. This one had a rope ladder we used to get up and down, a length of knotted rope that was tied to the tree at the top and dangled over the edge so that it reached the ground.
Once you were up in the fort, you could pull the ladder up behind you. It was much, much harder to get into the fort without the ladder....
In less serious (but not fully unserious) citation of that particular site, it also contains an earlier depiction of literally pulling up ladders (as part of a comic based on treating LOTR as though it were a D&D campaign) that shows off what can sometimes result: a disruptive shock from the ones stuck on the lower side, in this case via a leap in technology level.
The history of science has tons of examples of the same thing being discovered multiple time independently; wikipedia has a whole list of examples here. If your goal in studying the history of science is to extract the predictable/overdetermined component of humanity's trajectory, then it makes sense to focus on such examples.
But if your goal is to achieve high counterfactual impact in your own research, then you should probably draw inspiration from the opposite: "singular" discoveries, i.e. discoveries which nobody else was anywhere close to figuring out. After all, if someone else would have figured it out shortly after anyways, then the discovery probably wasn't very counterfactually impactful.
Alas, nobody seems to have made a list of highly counterfactual scientific discoveries, to complement wikipedia's list of multiple discoveries.
To...
I would not say that the central insight of SLT is about priors. Under weak conditions the prior is almost irrelevant. Indeed, the RLCT is independent of the prior under very weak nonvanishing conditions.
I don't think these conditions are particularly weak at all. Any prior that fulfils it is a prior that would not be normalised right if the parameter-function map were one-to-one.
It's a kind of prior like to use a lot, but that doesn't make it a sane choice.
A well-normalised prior for a regular model probably doesn't look very continuous or dif...
N.B. This is a chapter in a planned book about epistemology. Chapters are not necessarily released in order. If you read this, the most helpful comments would be on things you found confusing, things you felt were missing, threads that were hard to follow or seemed irrelevant, and otherwise mid to high level feedback about the content. When I publish I'll have an editor help me clean up the text further.
In the previous three chapters we broke apart our notions of truth and knowledge by uncovering the fundamental uncertainty contained within them. We then built back up a new understanding of how we're able to know the truth that accounts for our limited access to certainty. And while it's nice to have this better understanding, you might...
I know that you said comments should focus on things that were confusing, so I'll admit to being quite confused.
Thank you for detailing your thoughts. Some differences for me:
I offer, no consensus, but my own opinions:
Will AI get takeover capability? When?
0-5 years.
Single ASI or many AGIs?
There will be a first ASI that "rules the world" because its algorithm or architecture is so superior. If there are further ASIs, that will be because the first ASI wants there to be.
Will we solve technical alignment?
Contingent.
Value alignment, intent alignment, or CEV?
For an ASI you need the equivalent of CEV: values complete enough to govern an entire transhuman civilization.
Defense>offense or offense>defense?
Of...
[Setting: a suburban house. The interior of the house takes up most of the stage; on the audience's right, we see a wall in cross-section, and a front porch. Simplicia enters stage left and rings the doorbell.]
Doomimir: [opening the door] Well? What do you want?
Simplicia: I can't stop thinking about our last conversation. It was kind of all over the place. If you're willing, I'd like to continue, but focusing in narrower detail on a couple points I'm still confused about.
Doomimir: And why should I bother tutoring an Earthling in alignment theory? If you didn't get it from the empty string, and you didn't get it from our last discussion, why should I have any hope of you learning this time? And even if you did, what...
Doomimir: No, it wouldn't! Are you retarded?
Simplicia: [apologetically] Well, actually ...
Doomimir: [embarrassed] I'm sorry, Simplicia Optimistovna; I shouldn't have snapped at you like that.
[diplomatically] But I think you've grievously misunderstood what the KL penalty in the RLHF objective is doing. Recall that the Kullback–Leibler divergence represents how surprised you'd be by data from distribution , that you expected to be from distribution .
It's asymmetric: it blows up when the data is very unlikely according to , which amounts to seei...
FWIW it was obvious to me
It seems to me worth trying to slow down AI development to steer successfully around the shoals of extinction and out to utopia.
But I was thinking lately: even if I didn’t think there was any chance of extinction risk, it might still be worth prioritizing a lot of care over moving at maximal speed. Because there are many different possible AI futures, and I think there’s a good chance that the initial direction affects the long term path, and different long term paths go to different places. The systems we build now will shape the next systems, and so forth. If the first human-level-ish AI is brain emulations, I expect a quite different sequence of events to if it is GPT-ish.
People genuinely pushing for AI speed over care (rather than just feeling impotent) apparently think there is negligible risk of bad outcomes, but also they are asking to take the first future to which there is a path. Yet possible futures are a large space, and arguably we are in a rare plateau where we could climb very different hills, and get to much better futures.
I don't know what Katja thinks, but for me at least: I think AI might pose much more lock-in than other technologies. I.e., I expect that we'll have much less of a chance (and perhaps much less time) to redirect course, adapt, learn from trial and error, etc. than we typically do with a new technology. Given this, I think going slower and aiming to get it right on the first try is much more important than it normally is.