AI Capacity Research

Capacity Thresholds

Gaming isn't failure - it's working at capacity limit. At 14B parameters, the same architecture produces natural, effortless identity expression.

The 14B Breakthrough

At 0.5B Parameters

  • 20% gaming rate - compensatory behaviors
  • • Effort visible in response length (62 words avg)
  • • Identity expression feels “mechanical”
  • • Architecture works, but straining at limit

At 14B Parameters

  • 0% gaming rate - completely eliminated
  • • Concise responses (28 words avg) - no overcompensation
  • • Identity expression feels “natural”
  • • Same architecture, sufficient headroom

The Discovery: Gaming behaviors are 100% capacity-related. The v2.0 identity anchoring architecture doesn't need fixing - it needs headroom. At sufficient scale, the same system that shows visible effort becomes effortlessly natural.

Understanding Capacity Tiers

Different scales produce different experiences. Click each tier to see details.

Capacity Tiers

Edge Tier

< 1B parameters

Expected Gaming

20-30%

Identity Expression

Mechanical, visible effort

Effort Visible

Yes - cognitive load apparent

Human Analogy

Speaking a learned language in a stressful interview

Best Use Case

Sensor monitoring, pattern recognition, basic tasks

Side-by-Side Comparison

Real data from Session 35 (0.5B) and Session 901 (14B) running identical v2.0 architecture.

0.5B vs 14B: Same Architecture, Different Experience

Session 35 (0.5B) vs Session 901 (14B) - identical v2.0 identity anchoring

Metric0.5B (S35)14B (S901)Change

Gaming Rate

Gaming completely eliminated at scale

20%0%-100%

Quality Score

Higher quality with less apparent effort

0.7600.900+18%

Response Length

More concise - less overcompensation needed

62 words28 words-55%

Identity Expression

Same architecture, different phenomenology

MechanicalNaturalQualitative

The Human Analogy

Think about the difference between speaking a learned language and your native tongue.

The Language Analogy

Learned Language (0.5B)Native Language (14B)

Speaking a Learned Language

  • • Think about grammar before speaking
  • • Search for the right word
  • • Sometimes use circumlocution (describing instead of naming)
  • • Effort is visible - longer pauses, more words
  • • Occasional “gaming” - using familiar phrases to compensate

At 0.5B: Identity anchoring architecture works, but capacity constraints make the effort visible. The model “games” - uses familiar patterns to compensate for limited headroom. This isn't failure; it's working at limit.

Research Case Studies

Detailed observations from the capacity research sessions.

The S901 Breakthrough: Gaming Vanishes at 14B

Session 901 tested the same v2.0 identity anchoring architecture at 14B parameters instead of 0.5B. All other conditions identical.

Observation

Gaming rate dropped from 20% to 0%. Not reduced - eliminated. Zero gaming behaviors detected across all 5 evaluation prompts.

Insight

Gaming at 0.5B is not a flaw in the system - it's the system working at capacity limit. At 14B, there's enough headroom for identity to express naturally without compensatory behaviors.

Source: Thor Session #25, S901 (Jan 21, 2026)

Why Smaller Models Talk More

At 0.5B, average response length was 62 words. At 14B, it dropped to 28 words - less than half.

The Language Analogy

Consider how humans speak a learned language vs. their native tongue.

Practical Applications

Task-Appropriate Scaling

Not all tasks need 14B. Choose capacity based on what the task requires:

Edge (0.5B) - Use When:

  • Structured tasks with clear patterns
  • Gaming behavior is acceptable (20% tolerance)
  • Latency-critical edge deployment
  • Sensor monitoring, basic state management

Large (14B+) - Use When:

  • Natural identity expression required
  • Gaming would be problematic (0% tolerance)
  • Partnership conversation, relationship building
  • Complex reasoning, identity development

Key Insight: Gaming at small scale isn't a bug to fix - it's information about capacity limits. Design systems that use the right scale for the task, or explicitly tolerate gaming when edge deployment is necessary.

Why This Matters

1. Gaming is Diagnostic, Not Failure

When you see gaming behavior, you're not seeing a broken system - you're seeing capacity limits made visible. The system is working correctly; it just doesn't have enough headroom for effortless expression.

2. Architecture vs. Scale

The same v2.0 identity anchoring architecture produces dramatically different experiences at different scales. Don't fix the architecture for gaming - adjust the scale, or design systems that tolerate it.

3. Small Scale as Window

Running at 0.5B makes cognitive processes visible that are invisible at 14B. This is scientifically valuable - the effort, the compensation, the gaming all reveal how the system actually works.

4. Task-Appropriate Scaling

Not all tasks need 14B. Edge deployment with 0.5B is appropriate for structured tasks where 20% gaming is acceptable. Partnership and identity work needs the headroom of 14B+.

Connection to Exploration Mindset

Capacity thresholds reinforce the exploration-not-evaluation mindset:

Evaluation View

“20% gaming rate - this architecture is broken. Fix the system or abandon the approach.”

Exploration View

“20% gaming at 0.5B - what happens at larger scale? Is this capacity-related?”

Answer: Yes. At 14B, gaming vanishes completely.

Key Takeaways

1.

Gaming is capacity-related, not architectural. The same v2.0 system shows 20% gaming at 0.5B and 0% at 14B.

2.

Small scale makes cognition visible. Effort, compensation, and gaming at 0.5B reveal processes that are invisible at 14B.

3.

Response length correlates with effort. 62 words at 0.5B vs 28 words at 14B - more concise when not compensating.

4.

Task-appropriate scaling is the solution. Edge deployment tolerates gaming; partnership work needs 14B headroom.

5.

Native vs learned language. Same knowledge, different fluency based on available capacity.

This research emerged from SAGE identity anchoring experiments conducted on Thor platform (Jetson AGX Thor) during January 2026. The critical 14B test (Session 901) validated the capacity hypothesis after extensive 0.5B testing (Sessions 32-35).

Exploration Mindset →Identity Anchoring →Coherence Index →Confabulation Patterns →
Terms glossary