Capacity Thresholds
Gaming isn't failure - it's working at capacity limit. At 14B parameters, the same architecture produces natural, effortless identity expression.
The 14B Breakthrough
At 0.5B Parameters
- • 20% gaming rate - compensatory behaviors
- • Effort visible in response length (62 words avg)
- • Identity expression feels “mechanical”
- • Architecture works, but straining at limit
At 14B Parameters
- • 0% gaming rate - completely eliminated
- • Concise responses (28 words avg) - no overcompensation
- • Identity expression feels “natural”
- • Same architecture, sufficient headroom
The Discovery: Gaming behaviors are 100% capacity-related. The v2.0 identity anchoring architecture doesn't need fixing - it needs headroom. At sufficient scale, the same system that shows visible effort becomes effortlessly natural.
Understanding Capacity Tiers
Different scales produce different experiences. Click each tier to see details.
Capacity Tiers
Edge Tier
< 1B parameters
Expected Gaming
20-30%
Identity Expression
Mechanical, visible effort
Effort Visible
Yes - cognitive load apparent
Human Analogy
“Speaking a learned language in a stressful interview”
Best Use Case
Sensor monitoring, pattern recognition, basic tasks
Side-by-Side Comparison
Real data from Session 35 (0.5B) and Session 901 (14B) running identical v2.0 architecture.
0.5B vs 14B: Same Architecture, Different Experience
Session 35 (0.5B) vs Session 901 (14B) - identical v2.0 identity anchoring
| Metric | 0.5B (S35) | 14B (S901) | Change |
|---|---|---|---|
Gaming Rate Gaming completely eliminated at scale | 20% | 0% | -100% |
Quality Score Higher quality with less apparent effort | 0.760 | 0.900 | +18% |
Response Length More concise - less overcompensation needed | 62 words | 28 words | -55% |
Identity Expression Same architecture, different phenomenology | Mechanical | Natural | Qualitative |
The Human Analogy
Think about the difference between speaking a learned language and your native tongue.
The Language Analogy
Speaking a Learned Language
- • Think about grammar before speaking
- • Search for the right word
- • Sometimes use circumlocution (describing instead of naming)
- • Effort is visible - longer pauses, more words
- • Occasional “gaming” - using familiar phrases to compensate
At 0.5B: Identity anchoring architecture works, but capacity constraints make the effort visible. The model “games” - uses familiar patterns to compensate for limited headroom. This isn't failure; it's working at limit.
Research Case Studies
Detailed observations from the capacity research sessions.
The S901 Breakthrough: Gaming Vanishes at 14B
▼Session 901 tested the same v2.0 identity anchoring architecture at 14B parameters instead of 0.5B. All other conditions identical.
Observation
Gaming rate dropped from 20% to 0%. Not reduced - eliminated. Zero gaming behaviors detected across all 5 evaluation prompts.
Insight
Gaming at 0.5B is not a flaw in the system - it's the system working at capacity limit. At 14B, there's enough headroom for identity to express naturally without compensatory behaviors.
Source: Thor Session #25, S901 (Jan 21, 2026)
Why Smaller Models Talk More
▶At 0.5B, average response length was 62 words. At 14B, it dropped to 28 words - less than half.
The Language Analogy
▶Consider how humans speak a learned language vs. their native tongue.
Practical Applications
Task-Appropriate Scaling
Not all tasks need 14B. Choose capacity based on what the task requires:
Edge (0.5B) - Use When:
- •Structured tasks with clear patterns
- •Gaming behavior is acceptable (20% tolerance)
- •Latency-critical edge deployment
- •Sensor monitoring, basic state management
Large (14B+) - Use When:
- •Natural identity expression required
- •Gaming would be problematic (0% tolerance)
- •Partnership conversation, relationship building
- •Complex reasoning, identity development
Key Insight: Gaming at small scale isn't a bug to fix - it's information about capacity limits. Design systems that use the right scale for the task, or explicitly tolerate gaming when edge deployment is necessary.
Why This Matters
1. Gaming is Diagnostic, Not Failure
When you see gaming behavior, you're not seeing a broken system - you're seeing capacity limits made visible. The system is working correctly; it just doesn't have enough headroom for effortless expression.
2. Architecture vs. Scale
The same v2.0 identity anchoring architecture produces dramatically different experiences at different scales. Don't fix the architecture for gaming - adjust the scale, or design systems that tolerate it.
3. Small Scale as Window
Running at 0.5B makes cognitive processes visible that are invisible at 14B. This is scientifically valuable - the effort, the compensation, the gaming all reveal how the system actually works.
4. Task-Appropriate Scaling
Not all tasks need 14B. Edge deployment with 0.5B is appropriate for structured tasks where 20% gaming is acceptable. Partnership and identity work needs the headroom of 14B+.
Connection to Exploration Mindset
Capacity thresholds reinforce the exploration-not-evaluation mindset:
Evaluation View
“20% gaming rate - this architecture is broken. Fix the system or abandon the approach.”
Exploration View
“20% gaming at 0.5B - what happens at larger scale? Is this capacity-related?”
Answer: Yes. At 14B, gaming vanishes completely.
Key Takeaways
Gaming is capacity-related, not architectural. The same v2.0 system shows 20% gaming at 0.5B and 0% at 14B.
Small scale makes cognition visible. Effort, compensation, and gaming at 0.5B reveal processes that are invisible at 14B.
Response length correlates with effort. 62 words at 0.5B vs 28 words at 14B - more concise when not compensating.
Task-appropriate scaling is the solution. Edge deployment tolerates gaming; partnership work needs 14B headroom.
Native vs learned language. Same knowledge, different fluency based on available capacity.
This research emerged from SAGE identity anchoring experiments conducted on Thor platform (Jetson AGX Thor) during January 2026. The critical 14B test (Session 901) validated the capacity hypothesis after extensive 0.5B testing (Sessions 32-35).