Hypothesis Testing

Input: $ARGUMENTS

Interpretations

Before executing, identify which interpretation matches the user’s input:

Interpretation 1 — Test a specific claim: The user has a concrete hypothesis or assertion and wants a rigorous framework to evaluate whether it holds up against evidence. Interpretation 2 — Validate a business assumption: The user has a belief about their market, customers, or strategy (e.g., “I think users want X”) and needs to design a way to confirm or refute it. Interpretation 3 — Investigate a hunch: The user has a theory about why something is happening (e.g., “I think the bug is caused by X” or “I suspect my fatigue is from Y”) and wants structured help working through it.

If ambiguous, ask: “I can help with rigorously testing a formal claim, validating a business assumption, or investigating a personal hunch — which fits?” If clear from context, proceed with the matching interpretation.

Overview

Systematic procedure for formulating testable hypotheses, designing tests, and updating beliefs based on evidence.

Depth Scaling

Default: 2x. Parse depth from $ARGUMENTS if specified (e.g., “/ht 4x [input]”).

Depth	Min Hypotheses	Min Tests per Hypothesis	Min Competing Explanations	Min Falsification Attempts
1x	2	1	1	1
2x	3	2	2	2
4x	5	3	3	3
8x	7	4	5	5
16x	10	6	7	8

These are floors. Go deeper where insight is dense. Compress where it’s not.

Step 0: Context Detection and Variant Selection

Before full analysis, assess context:

Factor	Value	Notes
Time Pressure	URGENT / NEAR / NORMAL
Stakes	HIGH / MED / LOW
Domain Expertise	EXPERT / INTERMEDIATE / NOVICE
Test Cost	CHEAP / EXPENSIVE

Variant Selection

Context	Variant	Steps
URGENT	HT-Lite	1 (clarify claim), 4 (minimal test), 7 (quick conclude)
LOW stakes + CHEAP test	HT-Quick	1, 4, 5, 7
EXPERT + known domain	HT-Check	2-4, 7 (skip basics, focus on test design)
CHEAP test + learning goal	HT-After	1, 4, 7 + document learnings
HIGH stakes + EXPENSIVE test	HT-Full	All 7 steps + replication planning
Default	HT-Standard	All 7 steps

Selected variant: [variant] because [reasoning]

Steps

Step 1: Clarify the claim and scope

Precisely specify what is being claimed:

STATE THE CLAIM CLEARLY
- What exactly is being asserted?
- Remove ambiguity and vagueness
- Define all key terms
Vague: “Exercise is good for you” Clear: “30 minutes of moderate aerobic exercise 3x/week reduces risk of cardiovascular disease”
IDENTIFY THE CLAIM TYPE

Existential claims:
- “X exists” or “There is an X”
- Hard to falsify (can always say “not found yet”)
Universal claims:
- “All X are Y” or “X always causes Y”
- Falsifiable by one counterexample
Statistical claims:
- “X is associated with Y” or “X increases probability of Y”
- Requires statistical evidence
Causal claims:
- “X causes Y”
- Requires controlled comparison
SPECIFY SCOPE CONDITIONS
- Under what conditions does the claim hold?
- What are the boundary conditions?
- What is the domain of application?
IDENTIFY COMPETING CLAIMS
- What alternative explanations exist?
- What would be true if this claim is false?
- Are there multiple competing hypotheses?
ASSESS BACKGROUND PLAUSIBILITY
- How well does this fit with established knowledge?
- What theory supports or contradicts it?
- What is your initial credence before testing?

Step 2: Formulate testable hypotheses

Transform the claim into testable hypotheses:

STATE THE RESEARCH HYPOTHESIS (H1)

Good hypotheses are:
- Specific: Precise enough to test
- Falsifiable: Could be proven wrong
- Grounded: Based on theory or prior evidence
- Predictive: Make specific predictions
Format: “If [condition], then [prediction]”
STATE THE NULL HYPOTHESIS (H0)
- The hypothesis of no effect, no difference, no relationship
- What you would expect if the research hypothesis is false
- Usually: “There is no difference/relationship/effect”
Purpose: Provides a default to test against
STATE ALTERNATIVE HYPOTHESES
- Other explanations for expected results
- Competing theories that make different predictions
- Important for distinguishing between explanations
DERIVE SPECIFIC PREDICTIONS

From each hypothesis, derive:
- Observable predictions: What should we see?
- Quantitative predictions: How much/how large?
- Conditions: Under what circumstances?
More specific predictions = more informative tests
SPECIFY FALSIFICATION CRITERIA
- What evidence would falsify H1?
- What results would support H0?
- Be concrete: “H1 would be falsified if…”
Karl Popper’s criterion: If nothing could prove it wrong, it’s not scientific.

EXAMPLE:

Claim: “Mindfulness meditation reduces anxiety”

H1: Participants completing 8-week mindfulness program will show lower anxiety scores than waitlist control

H0: No difference in anxiety between groups

Alternative: Attention placebo explains any effect

Predictions:

Mindfulness group: 5+ point reduction on GAD-7
Control group: No significant change
Effect persists at 3-month follow-up

Falsification: Would reject H1 if mindfulness group shows no improvement or performs worse than control

Step 3: Assess prior probability

Estimate the prior probability of the hypothesis:

CONSIDER BASE RATES
- How often are similar claims true?
- What is the prior success rate in this field?
- Novel claims in low-reliability fields: lower priors
CONSIDER THEORETICAL SUPPORT
- Is there a plausible mechanism?
- Does it fit with established theory?
- Strong mechanism + good theory = higher prior
CONSIDER PRIOR EVIDENCE
- What previous studies suggest?
- What is the consensus view?
- Strong prior evidence = higher prior
CONSIDER EXTRAORDINARY CLAIMS
- Extraordinary claims require extraordinary evidence
- Claims that contradict well-established facts need very strong evidence to overturn
- ESP, perpetual motion, etc.: very low priors
ASSIGN A PRIOR PROBABILITY

Be explicit:
- Point estimate: “I estimate P(H1) = 30%”
- Range: “I estimate P(H1) between 20-40%”
- Calibration: Check against known frequencies
Priors should be:
- Honest: Reflect genuine uncertainty
- Defensible: Can explain reasoning
- Not extreme: Avoid 0% or 100% (unfalsifiable)
CONSIDER SENSITIVITY TO PRIORS
- How much does conclusion depend on prior?
- Would different reasonable priors change conclusion?
- Report sensitivity analysis

CALIBRATION GUIDELINES:

50%: Coin flip, genuinely uncertain
75%: Think it’s probably true, would bet modest amount
90%: Quite confident, would be surprised if wrong
99%: Very strong belief, extraordinary evidence to change

Note: This is pre-experimental probability, before your test.

Step 4: Design a severe test

Design a test that could actually falsify the hypothesis:

SEVERITY PRINCIPLE

A test is severe if:
- It has a good chance of revealing the hypothesis is false IF it actually is false
- Passing the test provides strong evidence
Weak tests: Would pass whether hypothesis true or false Strong tests: Would fail if hypothesis is false
INCREASE SEVERITY BY:

High-risk predictions:
- Predict specific outcomes, not vague trends
- Predict surprising outcomes (if true)
- Predict precise quantities
Example: Weak: “Treatment will help some people” Severe: “Treatment will produce 10-point improvement in 60% of participants within 4 weeks”

Good methodology:
- Controls for alternative explanations
- Adequate sample size for power
- Reliable and valid measures
- Blinding where possible
Multiple tests:
- Test different predictions from same hypothesis
- Converging evidence from different methods
- Replication across contexts
CHOOSE ALPHA LEVEL AND POWER

Alpha (Type I error rate):
- Conventional: 0.05
- Stricter for extraordinary claims: 0.01 or 0.001
Power (1 - Type II error rate):
- Minimum: 0.80
- Better: 0.90 or higher
- Higher power = more severe test
SPECIFY DECISION RULE

Before seeing results, specify:
- What would count as support for H1?
- What would count as support for H0?
- What would be inconclusive?
Example:
- Support H1: p < .05 with d > 0.3
- Support H0: p > .10 with d < 0.2
- Inconclusive: Otherwise
PRE-REGISTER

Commit to analysis plan before data collection:
- Prevents p-hacking and HARKing
- Distinguishes confirmatory from exploratory
- Increases credibility of findings

Step 5: Evaluate the evidence

Assess the strength of evidence from the test:

CLASSICAL HYPOTHESIS TESTING

P-value interpretation:
- P(data or more extreme | H0 is true)
- Small p: Data unlikely under H0
- Does NOT give probability H1 is true
Standard thresholds (arbitrary conventions):
- p < .05: “Statistically significant”
- p < .01: “Highly significant”
- p < .001: “Very highly significant”
Effect size:
- How large is the effect?
- Cohen’s d, r, odds ratio, etc.
- Practical vs. statistical significance
BAYESIAN UPDATING

Bayes’ theorem: P(H|D) = P(D|H) × P(H) / P(D)

Components:
- P(H): Prior probability of hypothesis
- P(D|H): Likelihood of data given hypothesis
- P(D): Probability of data (normalizing constant)
- P(H|D): Posterior probability after seeing data
Bayes factor:
- BF = P(D|H1) / P(D|H0)
- How much more likely is data under H1 vs H0?
- BF > 3: Substantial evidence for H1
- BF > 10: Strong evidence for H1
- BF > 100: Decisive evidence for H1
ASSESS EVIDENCE QUALITY

Was the test actually severe?
- Did methodology match plan?
- Were there unexpected issues?
- How many researcher degrees of freedom?
Internal validity:
- Are alternative explanations ruled out?
- Were controls adequate?
External validity:
- Does this generalize?
- Are there boundary conditions?
COMPARE TO PREDICTIONS

Exactly as predicted: Strong support Partially as predicted: Moderate support Opposite of predicted: Strong disconfirmation Null result: Depends on power

Note: High-powered null results are informative

Step 6: Update beliefs appropriately

Revise probability estimates based on evidence:

CALCULATE POSTERIOR PROBABILITY

Using Bayes’ theorem:

Posterior odds = Prior odds × Bayes factor

Example:
- Prior: P(H1) = 30% → Prior odds = 30/70 = 0.43
- Bayes factor: 5 (evidence 5x more likely under H1)
- Posterior odds: 0.43 × 5 = 2.14
- Posterior: P(H1) = 2.14/(1+2.14) = 68%
Strong evidence moves beliefs substantially Weak evidence moves beliefs modestly
CONSIDER REPLICATION

Single study provides limited evidence:
- Effect may not replicate
- Publication bias inflates effects
- Wait for replication before high confidence
Rules of thumb:
- One study: Tentative conclusion
- Multiple replications: Stronger confidence
- Failed replications: Reduce confidence
AVOID BELIEF UPDATING ERRORS

Base rate neglect:
- Don’t ignore prior probability
- A positive test doesn’t mean condition is likely if condition is rare
Confirmation bias:
- Update symmetrically for confirming/disconfirming evidence
- Disconfirming evidence should reduce belief
Anchoring:
- Don’t under-update based on strong evidence
- Allow beliefs to change substantially
Motivated reasoning:
- Don’t give favorable evidence more weight
- Apply same standards to all evidence
DOCUMENT BELIEF CHANGE

Record:
- Prior probability
- Evidence summary
- Posterior probability
- Reasoning for update
Be willing to say:
- “I was wrong”
- “The evidence changed my mind”
- “I’m now more/less confident”
IDENTIFY REMAINING UNCERTAINTY

What would further increase/decrease confidence? What additional evidence is needed? What are the key remaining uncertainties?

Step 7: Draw conclusions and decide

Formulate appropriate conclusions and next steps:

STATE THE CONCLUSION

Based on posterior probability:
- Strong support (>90%): “Evidence strongly supports H1”
- Moderate support (70-90%): “Evidence supports H1”
- Uncertain (40-70%): “Evidence is inconclusive”
- Moderate against (10-40%): “Evidence does not support H1”
- Strong against (<10%): “Evidence strongly refutes H1”
State your conclusion at the strength the evidence supports. If the posterior is 65%, say the evidence leans toward support — don’t retreat to ‘inconclusive’ because it feels safer.
DISTINGUISH TYPES OF CONCLUSIONS

Epistemic conclusions (about knowledge):
- “We have evidence that X”
- “X is more/less likely than we thought”
Practical conclusions (about action):
- “We should act as if X”
- “More research is needed before acting”
Evidence can be insufficient for knowledge but sufficient for practical decision
CONSIDER IMPLICATIONS

If hypothesis is supported:
- What does this mean for theory?
- What practical applications follow?
- What should we investigate next?
If hypothesis is refuted:
- What alternative explanations remain?
- What should we conclude instead?
- Was the hypothesis worth testing?
If inconclusive:
- What would resolve uncertainty?
- Is more powerful test possible?
- Should we suspend judgment?
SPECIFY NEXT STEPS

Additional research:
- Replication needed?
- Extension to new conditions?
- Address limitations?
Practical actions:
- What decisions follow?
- What should change based on this evidence?
Theory development:
- How should theory be revised?
- What new hypotheses emerge?
DOCUMENT FOR FUTURE REFERENCE

Create record:
- Hypothesis and predictions
- Test conducted
- Results obtained
- Conclusions drawn
- Next steps identified
Enable cumulative knowledge building

When to Use

Developing research hypotheses from theory or observation
Designing tests of specific claims or predictions
Deciding between competing explanations
Evaluating evidence for or against a claim
Updating probability estimates based on new evidence
Making decisions under uncertainty
Evaluating scientific or pseudoscientific claims

Verification

Context assessed and appropriate variant selected
Hypothesis is specific, testable, and falsifiable
Prior probability is explicit and justified
Test is severe enough to potentially falsify hypothesis
Evidence evaluated using appropriate statistical methods
Belief updating follows from evidence appropriately
Conclusion is appropriately hedged given evidence strength
Predictions logged for future calibration (→ /emv)

Niche Documentation

Where This Skill Works Best

Claims that CAN be tested empirically
Situations where evidence CAN change beliefs
Decisions where being wrong has significant consequences
Scientific or quasi-scientific claims
Situations with time to design and run tests

Where This Skill Struggles

Unfalsifiable claims (use /qaf instead)
Time-critical decisions (use HT-Lite or skip to action)
Low-stakes reversible decisions (just try it)
Pure value judgments (no empirical test possible)
Situations where test cost exceeds action cost

Integration Points

Invokes: /emv (for prediction logging)
Related: /assumption_verification, /exd
Routes to: /dct (if multiple hypotheses), /vbo

ht - Hypothesis Testing