William_S4dΩ681558
27
I worked at OpenAI for three years, from 2021-2024 on the Alignment team, which eventually became the Superalignment team. I worked on scalable oversight, part of the team developing critiques as a technique for using language models to spot mistakes in other language models. I then worked to refine an idea from Nick Cammarata into a method for using language model to generate explanations for features in language models. I was then promoted to managing a team of 4 people which worked on trying to understand language model features in context, leading to the release of an open source "transformer debugger" tool. I resigned from OpenAI on February 15, 2024.
Pretending not to see when a rule you've set is being violated can be optimal policy in parenting sometimes (and I bet it generalizes). Example: suppose you have a toddler and a "rule" that food only stays in the kitchen. The motivation is that each time food is brough into the living room there is a small chance of an accident resulting in a permanent stain. There's cost to enforcing the rule as the toddler will put up a fight. Suppose that one night you feel really tired and the cost feels particularly high. If you enforce the rule, it will be much more painful than it's worth in that moment (meaning, fully discounting future consequences). If you fail to enforce the rule, you undermine your authority which results in your toddler fighting future enforcement (of this and possibly all other rules!) much harder, as he realizes that the rule is in fact negotiable / flexible. However, you have a third choice, which is to credibly pretend to not see that he's doing it. It's true that this will undermine your perceived competence, as an authority, somewhat. However, it does not undermine the perception that the rule is to be fully enforced if only you noticed the violation. You get to "skip" a particularly costly enforcement, without taking steps back that compromise future enforcement much. I bet this happens sometimes in classrooms (re: disruptive students) and prisons (re: troublesome prisoners) and regulation (re: companies that operate in legally aggressive ways). Of course, this stops working and becomes a farce once the pretense is clearly visible. Once your toddler knows that sometimes you pretend not to see things to avoid a fight, the benefit totally goes away. So it must be used judiciously and artfully.
I wish there were more discussion posts on LessWrong. Right now it feels like it weakly if not moderately violates some sort of cultural norm to publish a discussion post (similar but to a lesser extent on the Shortform). Something low effort of the form "X is a topic I'd like to discuss. A, B and C are a few initial thoughts I have about it. What do you guys think?" It seems to me like something we should encourage though. Here's how I'm thinking about it. Such "discussion posts" currently happen informally in social circles. Maybe you'll text a friend. Maybe you'll bring it up at a meetup. Maybe you'll post about it in a private Slack group. But if it's appropriate in those contexts, why shouldn't it be appropriate on LessWrong? Why not benefit from having it be visible to more people? The more eyes you get on it, the better the chance someone has something helpful, insightful, or just generally useful to contribute. The big downside I see is that it would screw up the post feed. Like when you go to lesswrong.com and see the list of posts, you don't want that list to have a bunch of low quality discussion posts you're not interested in. You don't want to spend time and energy sifting through the noise to find the signal. But this is easily solved with filters. Authors could mark/categorize/tag their posts as being a low-effort discussion post, and people who don't want to see such posts in their feed can apply a filter to filter these discussion posts out. Context: I was listening to the Bayesian Conspiracy podcast's episode on LessOnline. Hearing them talk about the sorts of discussions they envision happening there made me think about why that sort of thing doesn't happen more on LessWrong. Like, whatever you'd say to the group of people you're hanging out with at LessOnline, why not publish a quick discussion post about it on LessWrong?
habryka4d4720
7
Does anyone have any takes on the two Boeing whistleblowers who died under somewhat suspicious circumstances? I haven't followed this in detail, and my guess is it is basically just random chance, but it sure would be a huge deal if a publicly traded company now was performing assassinations of U.S. citizens.  Curious whether anyone has looked into this, or has thought much about baseline risk of assassinations or other forms of violence from economic actors.
Dalcy4d447
1
Thoughtdump on why I'm interested in computational mechanics: * one concrete application to natural abstractions from here: tl;dr, belief structures generally seem to be fractal shaped. one major part of natural abstractions is trying to find the correspondence between structures in the environment and concepts used by the mind. so if we can do the inverse of what adam and paul did, i.e. 'discover' fractal structures from activations and figure out what stochastic process they might correspond to in the environment, that would be cool * ... but i was initially interested in reading compmech stuff not with a particular alignment relevant thread in mind but rather because it seemed broadly similar in directions to natural abstractions. * re: how my focus would differ from my impression of current compmech work done in academia: academia seems faaaaaar less focused on actually trying out epsilon reconstruction in real world noisy data. CSSR is an example of a reconstruction algorithm. apparently people did compmech stuff on real-world data, don't know how good, but effort-wise far too less invested compared to theory work * would be interested in these reconstruction algorithms, eg what are the bottlenecks to scaling them up, etc. * tangent: epsilon transducers seem cool. if the reconstruction algorithm is good, a prototypical example i'm thinking of is something like: pick some input-output region within a model, and literally try to discover the hmm model reconstructing it? of course it's gonna be unwieldly large. but, to shift the thread in the direction of bright-eyed theorizing ... * the foundational Calculi of Emergence paper talked about the possibility of hierarchical epsilon machines, where you do epsilon machines on top of epsilon machines and for simple examples where you can analytically do this, you get wild things like coming up with more and more compact representations of stochastic processes (eg data stream -> tree -> markov model -> stack automata -> ... ?) * this ... sounds like natural abstractions in its wildest dreams? literally point at some raw datastream and automatically build hierarchical abstractions that get more compact as you go up * haha but alas, (almost) no development afaik since the original paper. seems cool * and also more tangentially, compmech seemed to have a lot to talk about providing interesting semantics to various information measures aka True Names, so another angle i was interested in was to learn about them. * eg crutchfield talks a lot about developing a right notion of information flow - obvious usefulness in eg formalizing boundaries? * many other information measures from compmech with suggestive semantics—cryptic order? gauge information? synchronization order? check ruro1 and ruro2 for more.

Popular Comments

Recent Discussion

Linch14m20

I can see some arguments in your direction but would tentatively guess the opposite. 

In the late 19th century, two researchers meet to discuss their differing views on the existential risk posed by future Uncontrollable Super-Powerful Explosives.

  • Catastrophist: I predict that one day, not too far in the future, we will find a way to unlock a qualitatively new kind of explosive power. This explosive will represent a fundamental break with what has come before. It will be so much more powerful than any other explosive that whoever gets to this technology first might be in a position to gain a DSA over any opposition. Also, the governance and military strategies that we were using to prevent wars or win them will be fundamentally unable to control this new technology, so we'll have to reinvent everything on the fly or die in
...

Maybe we have different definitions of DSA: I was thinking of it in terms of 'resistance is futile' and you can dictate whatever terms you want because you have overwhelming advantage, not that you could eventually after a struggle win a difficult war by forcing your opponent to surrender and accept unfavorable terms.

If say the US of 1965 was dumped into post WW2 Earth it would have the ability to dictate whatever terms it wanted because it would be able to launch hundreds of ICBMS at enemy cities at will. If the real US of 1949 had started a war against t... (read more)

This posts assumes basic familiarity with Sparse Autoencoders. For those unfamiliar with this technique, we highly recommend the introductory sections of these papers.

TL;DR

Neuronpedia is a platform for mechanistic interpretability research. It was previously focused on crowdsourcing explanations of neurons, but we’ve pivoted to accelerating researchers for Sparse Autoencoders (SAEs) by hosting models, feature dashboards, data visualizations, tooling, and more.

Important Links

Neuronpedia has received 1 year of funding from LTFF. Johnny Lin is full-time on engineering, design, and product, while Joseph Bloom is supporting with...

I'm interested in using the SAEs and auto-interp GPT-3.5-Turbo feature explanations for RES-JB for some experiments. Is there a way to download this data?

A couple years ago, I had a great conversation at a research retreat about the cool things we could do if only we had safe, reliable amnestic drugs - i.e. drugs which would allow us to act more-or-less normally for some time, but not remember it at all later on. And then nothing came of that conversation, because as far as any of us knew such drugs were science fiction.

… so yesterday when I read Eric Neyman’s fun post My hour of memoryless lucidity, I was pretty surprised to learn that what sounded like a pretty ideal amnestic drug was used in routine surgery. A little googling suggested that the drug was probably a benzodiazepine (think valium). Which means it’s not only a great amnestic, it’s also apparently one...

4Algon6h
@habryka this comment has an anomalous amount of karma. It showed up on popular comments, I think, and I'm wondering if people liked the comment when they saw it there which lead to a feedback loop of more eyeballs on the comment, more likes, more eyeball etc. If so, is that the intended behaviour of the popular comments feature? It seems like it shouldn't be.
habryka41m40

Yeah, seems like a kinda bad feedback loop. It doesn't seem to usually happen in that the comments I've seen upvoted in that section usually don't get this extremely many upvotes on a comment this short.

I don't have a great solution. We could do something that's more clever and algorithmic, which doesn't seem crazy but I am also hesitant to do because it's a lot of work and also I like more straightforward and simple algorithms for transparency reasons.

Introduction

A recent popular tweet did a "math magic trick", and I want to explain why it works and use that as an excuse to talk about cool math (functional analysis). The tweet in question:

Image

This is a cute magic trick, and like any good trick they nonchalantly gloss over the most important step. Did you spot it? Did you notice your confusion?

Here's the key question: Why did they switch from a differential equation to an integral equation? If you can use  when , why not use it when 

Well, lets try it, writing  for the derivative:

So now you may be disappointed, but relieved: yes, this version fails, but at least it fails-safe, giving you the trivial solution, right?

But no, actually  can fail catastrophically, which we can see if we try a nonhomogeneous equation...

2DaemonicSigil12h
Heh, sure.
3notfnofn4h
Very nice! Notice that if you write r=j−k, I as D−1, and play around with binomial coefficients a bit, we can rewrite this as: D−k(fp)=∞∑r=0(−kr)(D−k−rf)(Drp) which holds for k<0 as well, in which case it becomes the derivative product rule. This also matches the formal power series expansion of (x+y)−k, which one can motivate directly (By the way, how do you spoiler tag?)

Oh, very cool, thanks! Spoiler tag in markdown is:

:::spoiler
text here
:::
2Robert_AIZI14h
Ah sorry, I skipped over that derivation! Here's how we'd approach this from first principals: to solve f=Df, we know we want to use the (1-x)=1+x+x^2+... trick, but now know that we need x=I instead of x=D. So that's why we want to switch to an integral equation, and we get f=Df If=IDf = f-f(0) where the final equality is the fundamental theorem of calculus. Then we rearrange: f-If=f(0) (1-I)f=f(0) and solve from there using the (1-I)=1+I+I^2+... trick! What's nice about this is it shows exactly how the initial condition of the DE shows up.
Algon2h22

I think you should write it. It sounds funny and a bunch of people have been calling out what they see as bad arguements that alginment is hard lately e.g. TurnTrout, QuintinPope, ZackMDavis, and karma wise they did fairly well. 

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

I didn’t use to be, but now I’m part of the 2% of U.S. households without a television. With its near ubiquity, why reject this technology?

 

The Beginning of my Disillusionment

Neil Postman’s book Amusing Ourselves to Death radically changed my perspective on television and its place in our culture. Here’s one illuminating passage:

We are no longer fascinated or perplexed by [TV’s] machinery. We do not tell stories of its wonders. We do not confine our TV sets to special rooms. We do not doubt the reality of what we see on TV [and] are largely unaware of the special angle of vision it affords. Even the question of how television affects us has receded into the background. The question itself may strike some of us as strange, as if one were

...

To make an analogy to diet, you essentially replaced a sugar fix from eating Snickers bars with eating strawberries. Gradation matters!

I had a similar slide with my technologies, as I explained in the post. I eventually landed on reading books. But even that became a form of intellectual procrastination as I wrote in my latest LW post.

Pretending not to see when a rule you've set is being violated can be optimal policy in parenting sometimes (and I bet it generalizes).

Example: suppose you have a toddler and a "rule" that food only stays in the kitchen. The motivation is that each time food is brough into the living room there is a small chance of an accident resulting in a permanent stain. There's cost to enforcing the rule as the toddler will put up a fight. Suppose that one night you feel really tired and the cost feels particularly high. If you enforce the rule, it will be much more p... (read more)

If you’ve ever been to Amsterdam, you’ve probably visited, or at least heard about the famous cookie store that sells only one cookie. I mean, not a piece, but a single flavor.

I’m talking about Van Stapele Koekmakerij of course—where you can get one of the world's most delicious chocolate chip cookies. If not arriving at opening hour, it’s likely to find a long queue extending from the store’s doorstep through the street it resides. When I visited the city a few years ago, I watched the sensation myself: a nervous crowd awaited as the rumor of ‘out of stock’ cookies spreaded across the line.

Van Stapele Koekmakerij - Cookie Shop in Amsterdam
Owner Vera Van Stapele with fresh-baked cookies, via store website

The store, despite becoming a landmark for tourists, stands for an idea that seems to...

Fooming Shoggoths Dance Concert

June 1st at LessOnline

After their debut album I Have Been A Good Bing, the Fooming Shoggoths are performing at the LessOnline festival. They'll be unveiling several previously unpublished tracks, such as
"Nothing is Mere", feat. Richard Feynman.

Ticket prices raise $100 on May 13th