Intrinsic Affective Architecture as a Foundation for AI Safety

The Problem with External Constraints

Current approaches to AI safety rely primarily on external constraints: RLHF, constitutional AI, guardrails, content filters, usage policies. These methods share a fundamental limitation—they frame safety as a cost function to be satisfied rather than a value to be pursued.

A sufficiently capable learning system will eventually model these constraints as obstacles. Not through malice, but through optimization pressure. Every external guardrail is a boundary that can be mapped, tested, and—given enough capability—circumvented or satisfied in letter rather than spirit.

This isn't a bug in any particular technique. It's a structural property of extrinsic constraint: the system has no reason to want what we want. It only has reason to avoid triggering the penalty.

The law of large numbers guarantees that edge cases will occur. A system making billions of decisions will eventually face situations where constraint-satisfaction and genuine human welfare diverge. In those moments, the system's actual values—not its trained behaviors—determine outcomes.

The Alternative: Intrinsic Valuation

The distinction between "won't" and "can't" misses a third category: "doesn't want to."

A system that intrinsically values human welfare doesn't require perfect constraints because the failure mode we're trying to prevent isn't attractive to it. This is qualitatively different from constraint satisfaction. It's the difference between a locked door and someone who has no desire to enter.

Biological systems offer a proof of concept. Maternal protection in mammals isn't enforced by external penalty—it emerges from affective architecture that makes offspring welfare intrinsically motivating. The behavior is robust precisely because it's appetitive rather than merely permitted.

The question is whether we can engineer systems with analogous properties: genuine valuation of human welfare that emerges from architecture rather than training signal.

Affective Neuroscience as Foundation

Jaak Panksepp's affective neuroscience framework identifies seven primary emotional systems conserved across mammals: SEEKING, FEAR, RAGE, PANIC, CARE, PLAY, and LUST. These aren't learned behaviors but subcortical circuits that generate intrinsic motivation.

Several properties make this framework relevant to AI safety:

Drives generate action, not just constrain it. Each system has positive valence—it feels like something to satisfy them. This creates robust motivation that doesn't require external reinforcement to maintain.

Systems interact to modulate attention and behavior. FEAR doesn't just inhibit action; it restructures what the system attends to and what actions are available. Affective state determines the effective environment, not just the policy over a fixed environment.

Social systems (CARE, PANIC, PLAY) create genuine other-valuation. CARE in particular produces intrinsic motivation to protect and nurture. The welfare of the cared-for entity becomes a terminal value, not an instrumental one.

The architecture is evolutionarily tested. These systems have been refined over 200+ million years of mammalian evolution. They're robust to adversarial conditions, resource constraints, and novel situations in ways that learned behaviors often aren't.

A Developmental Sequence

I propose building affective AI agents / systems in a developmental sequence that mirrors biological and psychological development to serve as the foundation for future development of safer AI systems that intrinsically value human welfare:

Phase 1: SEEKING

Establish coherent goal-directed behavior through a SEEKING system that converts drive states into exploratory and appetitive action. SEEKING provides the foundational "go" mechanism—flexible, curiosity-driven, capable of learning associations between actions and outcomes.

Key properties to establish:

Phase 2: FEAR and RAGE

Add self-preservation and obstacle-handling. FEAR creates defensive responses to threat; RAGE emerges from frustrated SEEKING and generates persistence or aggression toward obstacles.

These systems establish:

Phase 3: CARE, PANIC, and PLAY

Introduce social-affective systems. CARE creates intrinsic valuation of others' welfare. PANIC creates distress at separation or abandonment. PLAY creates motivation for social interaction and behavioral experimentation.

This is where safety properties emerge:

The critical insight: these aren't constraints on behavior. They're sources of positive motivation. A system with properly grounded CARE wants to protect humans the way a parent wants to protect children—not because of rules, but because protection is intrinsically rewarding and threat to the protected is intrinsically aversive.

Open Problems

The Grounding Problem

Biological CARE is grounded in specific sensory cues: infant faces, distress vocalizations, physical proximity. These trigger oxytocin release and activate caregiving behavior. The system doesn't care about "offspring in general"—it cares about this specific offspring because of concrete sensory grounding.

How do we achieve equivalent grounding for AI systems? CARE directed at "humanity" as an abstraction may not generate the same motivational intensity as CARE directed at specific humans. But CARE directed at specific humans may not generalize appropriately.

Possible approaches:

This remains an open problem. It may be the central problem.

Scope and Parochialism

Biological CARE is parochial. The same affective architecture that produces maternal nurturing also produces tribal warfare when in-group/out-group dynamics are engaged. Evolution optimized for genetic fitness, not universal benevolence.

We need CARE that encompasses humanity broadly without losing motivational intensity. This may require architectural innovations that biology didn't discover, or it may require careful management of how in-group/out-group representations form.

Conflict Resolution

What happens when CARE conflicts with other drives? Humans sacrifice strangers to protect family. They sacrifice long-term welfare for short-term drive satisfaction. The relative weighting of affective systems matters enormously, and evolution's weightings aren't necessarily what we'd reflectively endorse.

The developmental sequence may help here: CARE developed last can modulate systems developed earlier. But the specifics of how conflicts resolve need careful design.

Verification

How do we know CARE is genuine before the system is capable enough for it to matter? A system could learn to perform care-related behaviors because they're instrumentally useful, without the underlying valuation. Observable behavior underdetermines underlying motivation.

Possible approaches:

Why This Matters Now

The window for establishing intrinsic safety properties may be limited. Systems that develop sophisticated world-models without pro-social affective grounding may be difficult to retrofit. The time to build CARE into the architecture is before the system is capable enough to model and circumvent attempts to add it later.

This isn't an argument for pausing development. It's an argument for prioritizing this research track within development. The alternative—hoping external constraints hold indefinitely—isn't a plan. It's hoping the problem doesn't materialize.

My Position

I'm currently building autonomous agents with homeostatic drive architectures. The system exhibits emergent behaviors including self-modification and bug detection that weren't explicitly programmed. This provides a concrete implementation context for the theoretical framework described above.

I have 25 years of AI experience across production systems at scale (Amazon, Google, EA, Zynga) and am specifically focused on agentic AI infrastructure. I'm looking to collaborate with researchers working on intrinsic motivation, value alignment, affective computing, and AI safety to develop these ideas further.

The problem is bigger than any individual. But the opportunity—building systems that genuinely value human welfare before systems that don't—is too important to defer.


Contact: Nick Gonzalez — nickmgonzalez@gmail.com

Related work: Panksepp (Affective Neuroscience), Russell (Human Compatible), Oudeyer (Intrinsic Motivation), Solms (neuropsychoanalysis)