Empirical Validation
Input: $ARGUMENTS (a plan, strategy, or set of claims to validate)
Overview
Problem: Story coherence is necessary but insufficient for plan quality. Coherent narratives can mask poor plans (the “Hollywood problem” - heist movies have perfect coherence but real heists fail).
Solution: Add empirical validation step that tests plans against external reality where possible.
Depth Scaling
Default: 2x. Parse depth from $ARGUMENTS if specified (e.g., “/emv 4x [input]”).
| Depth | Min Validation Tests | Min Independent Checks | Min Edge Cases Tested | Min Confidence Calibrations |
|---|---|---|---|---|
| 1x | 3 | 1 | 1 | 1 |
| 2x | 5 | 2 | 2 | 2 |
| 4x | 8 | 3 | 4 | 3 |
| 8x | 12 | 5 | 6 | 5 |
| 16x | 18 | 7 | 10 | 7 |
These are floors. Go deeper where insight is dense. Compress where it’s not.
Step 1: Identify Testable Predictions
For the plan/claims, extract testable predictions:
TESTABLE PREDICTIONS
====================
Plan/Claim: [the plan or claim being validated]
Predictions if this plan is CORRECT:
1. [Observable outcome 1] within [timeframe]
2. [Observable outcome 2] within [timeframe]
3. [Observable outcome 3] within [timeframe]
Predictions if this plan is WRONG:
1. [Observable failure 1] would indicate [what's wrong]
2. [Observable failure 2] would indicate [what's wrong]
3. [Observable failure 3] would indicate [what's wrong]
CRUX PREDICTIONS (would change the plan if different):
1. [Most important prediction to test]
2. [Second most important]
Step 2: Design Minimum Viable Test
Find the smallest test that would validate or invalidate key assumptions:
MINIMUM VIABLE TEST
===================
Key assumption to test: [the most important uncertain element]
Test design:
- What to do: [specific action]
- Resources needed: [minimal resources]
- Time required: [estimate]
- Success indicator: [what shows assumption is correct]
- Failure indicator: [what shows assumption is wrong]
- Confidence level: [what % confidence does this test provide]
Can this test be run BEFORE full commitment? [Y/N]
If NO, why not: [reason]
Alternative smaller test: [if applicable]
Step 3: Pre-Test vs Post-Hoc Decision
| Situation | Recommendation |
|---|---|
| Test is cheap, can run before commitment | RUN TEST FIRST |
| Test is expensive but commitment is more expensive | Consider test anyway |
| Cannot test until after commitment | Design post-commitment checkpoints |
| Plan is reversible | Act, then test via results |
| Plan is irreversible | Maximum pre-testing warranted |
PRE/POST DECISION
=================
Action reversibility: [REVERSIBLE / PARTIALLY / IRREVERSIBLE]
Test cost vs action cost: [TEST CHEAPER / SIMILAR / ACTION CHEAPER]
Information available pre-action: [HIGH / MEDIUM / LOW]
Recommendation: [TEST FIRST / ACT THEN TEST / SET CHECKPOINTS]
Step 4: Run Test or Design Checkpoints
If Testing First
TEST EXECUTION PLAN
===================
1. [Specific step 1]
2. [Specific step 2]
3. [Measure result]
4. [Compare to prediction]
If prediction confirmed: Proceed with plan
If prediction disconfirmed: [Revise plan OR abandon OR gather more info]
If Acting Then Testing
CHECKPOINT DESIGN
=================
Checkpoint 1: [After what milestone?]
- What to measure: [observable]
- Expected if on track: [prediction]
- Expected if off track: [warning sign]
- Decision rule: [continue / pivot / stop]
Checkpoint 2: [After what milestone?]
- [same format]
Checkpoint 3: [After what milestone?]
- [same format]
Step 5: Log Predictions for Tracking
For calibration over time, log predictions and outcomes:
PREDICTION LOG ENTRY
====================
Date: [today]
Plan/Claim: [summary]
Prediction: [specific, measurable prediction]
Confidence: [0-100%]
Timeframe: [when we'll know]
Test method: [how we'll verify]
Status: [PENDING / CONFIRMED / DISCONFIRMED / MODIFIED]
Outcome: [to be filled when known]
Lessons: [to be filled when known]
Save prediction logs to: library/predictions/[date]_[topic-slug].md
Step 6: Update Plan Based on Results
After test or checkpoint:
POST-VALIDATION UPDATE
======================
Original plan: [summary]
Test/Checkpoint: [what was done]
Result: [what happened]
Prediction vs Reality:
| Prediction | Reality | Match? |
|------------|---------|--------|
| [pred 1] | [actual] | [Y/N] |
| [pred 2] | [actual] | [Y/N] |
If predictions matched: [Proceed with higher confidence]
If predictions didn't match:
- What this reveals: [learning]
- Plan revision needed: [changes]
- New uncertainty: [what we still don't know]
- Next test: [if applicable]
Common Patterns
Pattern 1: Market Test
For: Business plans, product ideas Test: Small-scale trial, landing page, survey Measure: Conversion, interest, feedback
Pattern 2: Prototype Test
For: Technical plans, designs Test: Build smallest working version Measure: Does it function as expected?
Pattern 3: Expert Review
For: Plans in specialized domains Test: Get domain expert opinion Measure: What objections/blindspots do they see?
Pattern 4: Historical Analog
For: Situations with precedent Test: Research similar past situations Measure: What happened? Why?
Pattern 5: Stress Test
For: Plans with assumptions about conditions Test: What happens if [key assumption] is wrong? Measure: Does plan still work?
Integration with GOSM
This step should come AFTER:
- ARAW search (have explored the space)
- Coherence check (plan hangs together internally)
- Dual analysis (have contrarian perspective)
And BEFORE:
- Final commitment
- Resource allocation
- Step execution
GOSM Integration Point:
... → /vbo → /emv → Commit
When to Skip Empirical Validation
- Action is trivially reversible - Just do it and learn
- Time pressure is extreme - GOSM-Lite conditions
- No test is possible - Must commit without testing
- Cost of test exceeds cost of action - Just act
Even when skipping, LOG the prediction for future calibration.
Calibration Review (Periodic)
Review prediction logs periodically:
CALIBRATION REVIEW
==================
Period: [date range]
Predictions made: [count]
Predictions resolved: [count]
Accuracy by confidence level:
| Confidence | Made | Correct | Accuracy |
|------------|------|---------|----------|
| 90-100% | | | % |
| 70-89% | | | % |
| 50-69% | | | % |
| <50% | | | % |
Calibration assessment:
[ ] Well-calibrated (accuracy matches confidence)
[ ] Overconfident (accuracy < confidence)
[ ] Underconfident (accuracy > confidence)
Adjustment: [what to change in future predictions]
Execute now: Apply to input.