DanielFilan

What exactly is GPT-3's base objective?

I continue to think that the Risks from Learned Optimization terminology is really good, for the specific case that it's talking about. The problem is just that it's not general enough to handle all possible ways of training a model using machine learning.

GPT-3 was trained using supervised learning, which I would have thought was a pretty standard way of training a model using machine learning. What training scenarios do you think the Risks from Learned Optimization terminology can handle, and what's the difference between those and the way GPT-3 was trained?

AMA: Paul Christiano, alignment researcher

What changed your mind about Chaitin's constant?

Emergent modularity and safety

It's true! Altho I think of putting something up on arXiv as a somewhat lower bar than 'publication' - that paper has a bit of work left.

Welcome & FAQ!

I really like the art!

Finite Factored Sets: Orthogonality and Time

OK I think this is a typo, from the proof of prop 10 where you deal with condition 5:

Thus .

I think this should be .

Finite Factored Sets: Orthogonality and Time

From def 16:

... if for all

Should I take this to mean "if for all and "?

[EDIT: no, I shouldn't, since and are both subsets of ]

A simple example of conditional orthogonality in finite factored sets

Seems right. I still think it's funky that X_1 and X_2 are conditionally non-orthogonal even when the range of the variables is unbounded.

AXRP Episode 9 - Finite Factored Sets with Scott Garrabrant

I'm glad to hear that the podcast is useful for people :)

Knowledge is not just mutual information

Seems like maybe the solution should perhaps be that you should only take 'the system' to be the 'controllable' physical variables, or those variables that are relevant for 'consequential' behaviour? Hopefully if one can provide good definitions for these, it will provide a foundation for saying what the abstractions should be that let us distinguish between 'high-level' and 'low-level' behaviour.

I think you might be misunderstanding this? My take is that "return" is just the discounted sum of future rewards, which you can (in an idealized setting) think of as a mathematical function of the future trajectory of the system. So it's still well-defined even when you aren't updating weights.