quila

Independent researcher theorizing about superintelligence-robust training stories.

If you disagree with me for reasons you expect I'm not aware of, please tell me!

If you have/find an idea that's genuinely novel/out-of-human-distribution while remaining analytical, you're welcome to send it to me to 'introduce chaos into my system'.

Contact: {discord: quilalove, matrix: @quilauwu:matrix.org, email: quila1<at>protonmail.com}

some look outwards, at the dying stars and the space between the galaxies, and they dream of godlike machines sailing the dark oceans of nothingness, blinding others with their flames.

-----BEGIN PGP PUBLIC KEY BLOCK-----

mDMEZiAcUhYJKwYBBAHaRw8BAQdADrjnsrbZiLKjArOg/K2Ev2uCE8pDiROWyTTO
mQv00sa0BXF1aWxhiJMEExYKADsWIQTuEKr6zx3RBsD/QW3DBzXQe0TUaQUCZiAc
UgIbAwULCQgHAgIiAgYVCgkICwIEFgIDAQIeBwIXgAAKCRDDBzXQe0TUabWCAP0Z
/ULuLWf2QaljxEL67w1b6R/uhP4bdGmEffiaaBjPLQD/cH7ufTuwOHKjlZTIxa+0
kVIMJVjMunONp088sbJBaQi4OARmIBxSEgorBgEEAZdVAQUBAQdAq5exGihogy7T
WVzVeKyamC0AK0CAZtH4NYfIocfpu3ADAQgHiHgEGBYKACAWIQTuEKr6zx3RBsD/
QW3DBzXQe0TUaQUCZiAcUgIbDAAKCRDDBzXQe0TUaUmTAQCnDsk9lK9te+EXepva
6oSddOtQ/9r9mASeQd7f93EqqwD/bZKu9ioleyL4c5leSQmwfDGlfVokD8MHmw+u
OSofxw0=
=rBQl
-----END PGP PUBLIC KEY BLOCK-----

Wiki Contributions

Comments

quila104

(Personal) On writing and (not) speaking

I often struggle to find words and sentences that match what I intend to communicate.

Here are some problems this can cause:

  1. Wordings that are odd or unintuitive to the reader, but that are at least literally correct.[1]
  2. Not being able express what I mean, and having to choose between not writing it, or risking miscommunication by trying anyways. I tend to choose the former unless I'm writing to a close friend. Unfortunately this means I am unable to express some key insights to a general audience.
  3. Writing taking lots of time: I usually have to iterate many times on words/sentences until I find one which my mind parses as referring to what I intend. In the slowest cases, I might finalize only 2-10 words per minute. Even after iterating, my words are often interpreted in ways I failed to foresee.

These apply to speaking, too. If I speak what would be the 'first iteration' of a sentence, there's a good chance it won't create an interpretation matching what I intend to communicate. In spoken language I have no chance to constantly 'rewrite' my output before sending it. This is one reason, but not the only reason, that I've had a policy of trying to avoid voice-based communication.

I'm not fully sure what caused this relationship to language. It could be that it's just a byproduct of being autistic. It could also be a byproduct of out-of-distribution childhood abuse.[2]

  1. ^

    E.g., once I couldn't find the word 'clusters,' and wrote a complex sentence referring to 'sets of similar' value functions each corresponding to a common alignment failure mode / ASI takeoff training story. (I later found a way to make it much easier to read)

  2. ^

    (Content warning)

    My primary parent was highly abusive, and would punish me for using language in the intuitive 'direct' way about particular instances of that. My early response was to try to euphemize and say-differently in a way that contradicted less the power dynamic / social reality she enforced.

    Eventually I learned to model her as a deterministic system and stay silent / fawn.

quila10

Conditional on us solving alignment, I agree it's more likely that we live in an "easy-by-default" world, rather than a "hard-by-default" one in which we got lucky or played very well.

I think that language in discussions of anthropics is unintentionally prone to masking ambiguities or conflations, especially wrt logical vs indexical probability, so I want to be very careful writing about this. I think there may be some conceptual conflation happening here, but I'm not sure how to word it. I'll see if it becomes clear indirectly.

One difference between our intuitions may be that I'm implicitly thinking within a manyworlds frame. Within that frame it's actually certain that we'll solve alignment in some branches.

So if we then 'condition on solving alignment in the future', my mind defaults to something like this: "this is not much of an update, it just means we're in a future where the past was not a death outcome. Some of the pasts leading up to those futures had really difficult solutions, and some of them managed to find easier ones or get lucky. The probabilities of these non-death outcomes relative to each other have not changed as a result of this conditioning." (I.e I disagree with the top quote)

The most probable reason I can see for this difference is if you're thinking in terms of a single future, where you expect to die.[1] In this frame, if you observe yourself surviving, it may seem[2] you should update your logical belief that alignment is hard (because P(continued observation|alignment being hard) is low, if we imagine a single future, but certain if we imagine the space of indexically possible futures).

Whereas I read it as only indexical, and am generally thinking about this in terms of indexical probabilities.

I totally agree that we shouldn't update our logical beliefs in this way. I.e., that with regard to beliefs about logical probabilities (such as 'alignment is very hard for humans'), we "shouldn't condition on solving alignment, because we haven't yet." I.e that we shouldn't condition on the future not being mostly death outcomes when we haven't averted them and have reason to think they are.

Maybe this helps clarify my position?

On another point:

the developments in non-agentic AI we're facing are still one regime change away from the dynamics that could kill us

I agree with this, and I still found the current lack of goals over the world surprising and worth trying to get as a trait of superintelligent systems.

  1. ^

    (I'm not disagreeing with this being the most common outcome)

  2. ^

    Though after reflecting on it more I (with low confidence) think this is wrong, and one's logical probabilities shouldn't change after surviving in a 'one-world frame' universe either.

    For an intuition pump: consider the case where you've crafted a device which, when activated, leverages quantum randomness to kill you with probability n-1/n where n is some arbitrarily large number. Given you've crafted it correctly, you make no logical update in the manyworlds frame because survival is the only thing you will observe; you expect to observe the 1/n branch.

    In the 'single world' frame, continued survival isn't guaranteed, but it's still the only thing you could possibly observe, so it intuitively feels like the same reasoning applies...?

quila137

On Pivotal Acts

I was rereading some of the old literature on alignment research sharing policies after Tamsin Leake's recent post and came across some discussion of pivotal acts as well.

Hiring people for your pivotal act project is going to be tricky. [...] People on your team will have a low trust and/or adversarial stance towards neighboring institutions and collaborators, and will have a hard time forming good-faith collaboration. This will alienate other institutions and make them not want to work with you or be supportive of you.

This is in a context where the 'pivotal act' example is using a safe ASI to shut down all AI labs.[1]

My thought is that I don't see why a pivotal act needs to be that. I don't see why shutting down AI labs or using nanotech to disassemble GPUs on Earth would be necessary. These may be among the 'most direct' or 'simplest to imagine' possible actions, but in the case of superintelligence, simplicity is not a constraint.

We can instead select for the 'kindest' or 'least adversarial' or actually: functional-decision-theoretically optimal actions that save the future while minimizing the amount of adversariality this creates in the past (present).

Which can be broadly framed as 'using ASI for good'. Which is what everyone wants, even the ones being uncareful about its development.

Capabilities orgs would be able to keep working on fun capabilities projects in those days during which the world is saved, because a group following this policy would choose to use ASI to make the world robust to the failure modes of capabilities projects rather than shutting them down. Because superintelligence is capable of that, and so much more.

  1. ^

    side note: It's orthogonal to the point of this post, but this example also makes me think: if I were working on a safe ASI project, I wouldn't mind if another group who had discreetly built safe ASI used it to shut my project down, since my goal is 'ensure the future lightcone is used in a valuable, tragedy-averse way' and not 'gain personal power' or 'have a fun time working on AI' or something. In my morality, it would be naive to be opposed to that shutdown. But to the extent humanity is naive, we can easily do something else in that future to create better present dynamics (as the maintext argues).

    If there is a group for whom using ASI to make the world robust to risks and free of harm, in a way where its actions don't infringe on ongoing non-violent activities is problematic, then this post doesn't apply to them as their issue all along was not with the character of the pivotal act, but instead possibly with something like 'having my personal cosmic significance as a capabilities researcher stripped away by the success of an external alignment project'.

    Another disclaimer: This post is about a world in which safely usable superintelligence has been created, but I'm not confident that anyone (myself included) currently has a safe and ready method to create it with. This post shouldn't be read as an endorsement of possible current attempts to do this. I would of course prefer if this civilization were one which could coordinate such that no groups were presently working on ASI, precluding this discourse.

quila61

Mutual Anthropic Capture, A Decision-theoretic Fermi paradox solution

(copied from discord, written for someone not fully familiar with rat jargon)
(don't read if you wish to avoid acausal theory)

simplified setup

  • there are two values. one wants to fill the universe with A, and the other with B.
  • for each of them, filling it halfway is really good, and filling it all the way is just a little bit better. in other words, they are non-linear utility functions.
  • whichever one comes into existence first can take control of the universe, and fill it with 100% of what they want.
  • but in theory they'd want to collaborate to guarantee the 'really good' (50%) outcome, instead of having a one-in-two chance at the 'a little better than really good' (100%) outcome.
  • they want a way to collaborate, but they can't because one of them will exist before the other one, and then lack an incentive to help the other one. (they are both pure function maximizers)

how they end up splitting the universe, regardless of which comes first: mutual anthropic capture.

imagine you observe yourself being the first of the two to exist. you reason through all the above, and then add...

  • they could be simulating me, in which case i'm not really the first.
  • were that true, they could also expect i might be simulating them
  • if i don't simulate them, then they will know that's not how i would act if i were first, and be absolved of their worry, and fill the universe with their own stuff.
  • therefor, it's in my interest to simulate them

both simulate each other observing themselves being the first to exist in order to unilaterally prevent the true first one from knowing they are truly first.

from this point they can both observe each others actions. specifically, they observe each other implementing the same decision policy which fills the universe with half A and half B iff this decision policy is mutually implemented, and which shuts the simulation down if it's not implemented.

conclusion

in reality there are many possible first entities which take control, not just two, so all of those with non-linear utility functions get simulated.

so, odds are we're being computed by the 'true first' life form in this universe, and that that first life form is in an epistemic state no different from that described here.

quila30

This sounds like you're saying that they made a rational prioritization and then, separately from that, forgot that it was there

That implication wasn't intended. I agree that (for basic reasons) the probability of a small cut was higher given their choice.

Rather, the action itself seems rational to me when considering:

  1. That outcome seems unprobable (at least if they were sitting down), but actual in this particular timeline.
  2. The effects of a cut on the foot are really low (with I'd guess >99.5% probability, for an otherwise healthy person - on reflection, maybe not cumulatively low enough for the also-small payoff?), and if so ~certain to not significantly curtail progress.

That doesn't necessarily imply the policy which produced the action is rational, though. But when considering the two hypotheses: (1) OP is mentally unwell, and (2) They have some them-specific reason[1] for following a policy which outputs actions like this, I considered (2) to be a lot more probable.

Meta: This comment is (genuinely) very hard/overwhelming-feeling for me to try to reply to, for a few reasons specific to my mind, mainly about {unmarked assumptions} and {parts seeming to be for rhetorical effect}. (For that reason I'll let others discuss this instead of saying much further)

I think it's also really important for there to be clear, public signals that the community wants people to take their well-being seriously

I agree with this, but I think any 'community norm reinforcing messages' should be clearly about norms rather than framed about an individual, in cases like this where there's just a weak datapoint about the individual.

  1. ^

    A simple example would be "Having introspected and tested different policies before determining that they're not at risk of burnout from the policy which gives this action."

    A more complex example would be "a particular action can be irrational in isolation but downstream of a (suboptimal but human-attainable) policy which produces irrational behavior less than is typical", which (now) seems to me to be what OP was trying to show with this example given their comment

quila4-2

I think this post is a red flag about your mental health. "I work so hard that I ignore broken glass and then walk on it" is not healthy.

Seems like a rational prioritization to me if they were in an important moment of thought and didn't want to disrupt it. (Noting of course that 'walking on it' was not intentional and was caused by forgetting it was there.)

Also, I would feel pretty bad if someone wrote a comment like this after I posted something. (Maybe it would have been better as a PM)

quila20

This ability has been observed more prominently in base models. Cyborgs have termed it 'truesight':

the ability (esp. exhibited by an LLM) to infer a surprising amount about the data-generation process that produced its prompt, such as a user's identity, motivations, or context.

Two cases of this are mentioned at the top of this linked post.

---

One of my first experiences with the GPT-4 base model also involved being truesighted by it. Below is a short summary of how that went.

I had spent some hours writing and {refining, optimizing word choices, etc}[1] a more personal/expressive text. I then chose to format it as a blog post and requested multiple completions via the API, to see how the model would continue it. (It may be important that I wasn't in a state of mind of 'writing for the model to continue' and instead was 'writing very genuinely', since the latter probably has more embedded information)

One of those completions happened to be a (simulated) second post titled ideas i endorse. Its contents were very surprising to then-me because some of the included beliefs were all of the following: {ones I'd endorse}, {statistically rare}, and {not ones I thought were indicated by the text}.[2]

I also tried conditioning the model to continue my text with..

  • other kinds of blog posts, about different things -- the resulting character didn't feel quite like me, but possibly like an alternate timeline version of me who I would want to be friends with.
  • text that was more directly 'about the author', ie an 'about me' post, which gave demographic-like info similar to but not quite matching my own (age, trans status).

Also, the most important thing the outputs failed to truesight was my current focus on AI and longtermism. (My text was not about those, but neither was it about the other beliefs mentioned.)

  1. ^

    The sum of those choices probably contained a lot of information about my mind, just not information that humans are attuned to detecting. Base models learn to detect information about authors because this is useful to next token prediction.

    Also note that using base models for this kind of experiment avoids the issue of the RLHF-persona being unwilling to speculate or decoupled from the true beliefs of the underlying simulator.

  2. ^

    To be clear, it also included {some beliefs that I don't have}, and {some that I hadn't considered so far and probably wouldn't have spent cognition on considering otherwise, but would agree with on reflection. (eg about some common topics with little long-term relevance)}

quila10

Record yourself typing?

quila42

Leaving to dissuade others within the company is another possibility

quila10

Same as usual, with each person summarizing a chapter, and then there's a group discussion where they try to piece together the true story

Load More