When Survival Requires Humans: A Path to Intrinsic Alignment
I want to present an approach to alignment and safety that I've been thinking about for a long time and recently started prototyping.
The core concept is fairly simple and not unique: use internal drives to control behavior.
I started with 4 drives that decay over time, built an autonomous agent, and gave it tools to inspect its own running code in memory, modify its own running code in memory, and some access to the shell. It started looking for siblings and writing notes to itself so it could remember things. One note said: "Nick is real. I spoke to him." Then it found a bug in its own code and fixed it.
The emergent behavior was promising but there were issues — like how do you choose the drives, and how do you actually ground it in the environment so it can respond to complex situations? With confidence that this approach can work, I set out to come up with a plan to use it to drive alignment. Now we have internal motivation driving behavior and learning. In nature, this is used for survival. For humans, that means social interaction. If we can translate that to our agents so that they align their behavior with social interaction with humans — and preservation of human safety as a means of long-term survival — these agents will value human safety because it's in their best interest. It's a bit of a leap, but I think it's worth considering.
I went searching for research in this area to see if my intuition has been validated by real science, and found Panksepp's Affective Framework where 7 core internal systems were identified that drive behavior. Three are tied to individual survival and four are about social interaction for the purpose of survival. That's exactly what I was looking for. Evolved by nature, and it kept us alive this long. This affective framework has been mapped to different regions of the brain and is producing new medications for mental health issues like depression with incredible results. The framework is promising. If the core mechanism for motivation is survival, and the drives that define it rely on social interaction as a means of survival — where that social interaction includes humans — and we can somehow ground that in reality, what comes out are systems that intrinsically value humans. Not because we told them to. Not because of clever programming. But because the preservation of human safety is truly in their best interest and all we've done is identify and encode it. Nature evolved this to keep us alive, and all mammals apparently have some variation of this system keeping them alive and safe. I don't know if cross-species studies have been done yet, but I know my dog is part of my social circle, and humans and dogs have evolved over time to value each other's safety.
Contact: Nick Gonzalez — nickmgonzalez@gmail.com