Growth Experiments

Input: $ARGUMENTS

Interpretations

Before executing, identify which interpretation matches the user’s input:

Interpretation 1 — Design a growth experiment: The user has a specific product or feature change they want to test and needs help structuring a rigorous A/B test or experiment around it. Interpretation 2 — Build an experimentation program: The user wants to establish or improve a systematic growth experimentation practice across their team or organization. Interpretation 3 — Analyze experiment results: The user has already run a growth experiment and needs help interpreting the data, assessing statistical significance, or deciding next steps.

If ambiguous, ask: “I can help with designing a specific growth experiment, building an experimentation program, or analyzing experiment results — which fits?” If clear from context, proceed with the matching interpretation.

Overview

Run systematic experiments to discover and validate growth levers using hypothesis-driven testing

Steps

Step 1: Establish experimentation foundation

Set up the infrastructure and culture for experiments:

Define primary metrics (North Star):

What is the one metric that matters most?
How does it tie to business value?
Can it be measured accurately?

Secondary metrics to monitor:

Leading indicators of primary metric
Guardrail metrics (things that shouldn’t get worse)
Segment-level metrics

Statistical requirements:

Minimum sample size for validity
Confidence level required (typically 95%)
Minimum detectable effect (MDE)
Test duration guidelines

Sample size calculation: Needed sample = f(baseline conversion, MDE, confidence, power) Tools: Evan Miller calculator, Optimizely calculator

Example: To detect 10% relative lift on 5% conversion with 95% confidence and 80% power = ~31,000 per variant

Experimentation tools:

A/B testing platforms (Optimizely, VWO, LaunchDarkly)
Analytics integration (GA4, Amplitude, Mixpanel)
Statistical analysis (Python/R or built-in tools)

Documentation setup:

Experiment tracking template
Results repository
Learning database

Step 2: Generate experiment hypotheses

Build a list of testable growth ideas:

Hypothesis sources:

Quantitative data:

Funnel analysis: Where do users drop off?
Cohort analysis: What do successful users do differently?
Segment analysis: Which segments perform better?
Feature usage: What correlates with retention?

Qualitative insights:

User interviews: What do they struggle with?
Support tickets: Common complaints/questions
Session recordings: Where do users get confused?
Sales conversations: What objections come up?

Competitive intelligence:

What do competitors do differently?
What worked for similar businesses?
Industry best practices

Team ideas:

Engineering insights on technical improvements
Sales insights on objection handling
Support insights on user struggles
Leadership hypotheses

Hypothesis format: “If we [change], then [metric] will improve by [amount] because [rationale based on evidence].”

Strong hypothesis characteristics:

Specific and measurable
Based on data or insights (not just guessing)
Explains the “why” not just the “what”
Testable within reasonable timeframe
Connected to primary metric

Generate 20-50 hypotheses before prioritizing.

Step 3: Prioritize using ICE framework

Rank experiments by expected value:

ICE scoring:

Impact (1-10): How big will the improvement be?
Confidence (1-10): How sure are we it will work?
Ease (1-10): How easy is it to implement?

ICE Score = Impact x Confidence x Ease

Scoring guidelines:

Impact scoring:

10: Transforms the metric (50%+ improvement)
7-9: Major improvement (20-50%)
4-6: Moderate improvement (10-20%)
1-3: Small improvement (<10%)

Consider: What % of users does this affect? How much does it change their behavior?

Confidence scoring:

10: Proven to work (replicated result)
7-9: Strong evidence (similar tests, research)
4-6: Reasonable hypothesis (logic + some data)
1-3: Educated guess (intuition only)

Consider: What evidence supports this? Has something similar worked before?

Ease scoring:

10: Trivial (copy change, hours of work)
7-9: Easy (days of work, no dependencies)
4-6: Moderate (weeks, some complexity)
1-3: Hard (months, cross-team, technical debt)

Consider: Engineering time, design needs, cross-functional dependencies, reversibility

Alternative frameworks:

RICE: Reach x Impact x Confidence / Effort
PIE: Potential x Importance x Ease

Rank all hypotheses and select top experiments. Balance quick wins with bigger bets.

Step 4: Design experiments rigorously

Create detailed experiment specifications:

Experiment spec template:

Hypothesis “If we [change], then [metric] will improve by [amount] because [rationale].”
Test design

Control: Current experience (describe)
Variant(s): Changed experience (describe specifically)
Target audience: Who is included/excluded
Allocation: % traffic to each variant

Success criteria

Primary metric: What defines success
Expected lift: Minimum meaningful improvement
Secondary metrics: What else to monitor
Guardrail metrics: What shouldn’t get worse

Sample size and duration

Required sample per variant
Expected duration based on traffic
Any timing considerations (seasonality, etc.)

Implementation details

Technical requirements
Design assets needed
QA checklist
Launch steps

Risk assessment

What could go wrong?
Reversibility plan
Escalation criteria

Experiment types:

A/B test: Control vs single variant

Best for: Clear hypothesis, sufficient traffic
Simpler to analyze

A/B/n test: Multiple variants

Best for: Testing multiple approaches
Requires more traffic

Multivariate test: Multiple changes combined

Best for: Understanding interactions
Requires much more traffic

Holdback test: New feature vs no feature

Best for: Measuring feature impact
Used post-launch to quantify value

Step 5: Run experiments with discipline

Execute experiments following best practices:

Pre-launch checklist:

Tracking verified (test events firing correctly)
Variants displaying correctly (QA all paths)
Success metrics baseline documented
Team aligned on success criteria
No conflicting experiments

During experiment:

Don’ts:

Don’t peek at results too early
Don’t make changes mid-experiment
Don’t expand scope while running
Don’t run conflicting experiments
Don’t stop early on positive trend

Do’s:

Monitor for technical issues daily
Document any anomalies observed
Check sample balance between variants
Verify randomization is working
Monitor guardrail metrics

When to stop early:

Technical issues affecting user experience
Clear negative impact on guardrail metrics
External events invalidating the test
Sample ratio mismatch indicating problems

Running multiple experiments:

Use mutual exclusion groups for conflicting tests
Document which experiments overlap
Consider interaction effects
Limit simultaneous tests to avoid confusion

Experiment velocity:

Aim for 2-4 experiments per month (starting)
Build toward 1-2 per week with maturity
Speed of learning > individual experiment wins

Step 6: Analyze results objectively

Evaluate experiment outcomes rigorously:

Statistical analysis:

Calculate confidence interval for difference
Determine if result is statistically significant
Check for segment-level effects
Analyze secondary and guardrail metrics

Result categories:

Winner:

Variant beats control with statistical significance
Secondary metrics neutral or positive
Guardrails not violated Action: Implement winner, document learnings

Loser:

Control beats variant with statistical significance
OR guardrails violated Action: Don’t implement, document why it failed

Inconclusive:

No statistical significance reached
Could be: need more traffic, effect smaller than MDE Action: Decide to extend, iterate, or abandon

Learning:

Result (win or lose) teaches something unexpected
Segment effects reveal new opportunities Action: Document insights for future experiments

Analysis best practices:

Don’t cherry-pick favorable segments
Consider novelty effects (early enthusiasm fades)
Check for Simpson’s paradox (segment vs aggregate)
Look at full distribution, not just averages
Consider practical significance, not just statistical

Common analysis mistakes:

Stopping early when trend looks good
Running until you get significance (p-hacking)
Ignoring secondary metrics that look bad
Not checking for segment effects
Attributing causation without controls

Step 7: Extract and document learnings

Capture knowledge from every experiment:

Learning documentation template:

Experiment summary:

Hypothesis tested
Result (win/lose/inconclusive)
Impact on primary metric
Impact on secondary metrics

Key learnings:

What did we learn about user behavior?
What did we learn about our assumptions?
What surprised us?
What should we test next?

Implications:

Should we implement the change?
What follow-up experiments does this suggest?
Does this change our strategy?
Who else should know this?

Learning types:

Tactical learnings:

“Headlines with numbers get 15% more clicks”
“Red CTAs outperform green by 8%”

Strategic learnings:

“Users value speed over features”
“Price sensitivity varies dramatically by segment”

Process learnings:

“We need 3 weeks minimum for this type of test”
“Mobile and desktop show different results”

Build institutional knowledge:

Maintain searchable experiment repository
Regular experiment reviews with team
Share learnings across departments
Update playbooks with proven tactics

Step 8: Build continuous learning loop

Create sustainable experimentation system:

Learning loop components:

Generate hypotheses (ongoing)

Regular data review sessions
User feedback integration
Cross-functional idea sharing
Competitive monitoring

Prioritize ruthlessly

Weekly backlog grooming
ICE score updates based on learnings
Balance quick wins and big bets
Deprioritize low-learning experiments

Run experiments (always have tests running)

Pipeline of ready-to-run experiments
Clear ownership and accountability
Experiment velocity tracking

Analyze and learn

Weekly results review
Monthly learning synthesis
Quarterly strategy impact assessment

Apply learnings

Implement winners
Update playbooks
Generate new hypotheses
Share across organization

Experimentation metrics:

Experiment velocity (tests per month)
Win rate (% of tests that win)
Impact (cumulative metric improvement)
Learning rate (insights generated)

Scaling experimentation:

Train team members on methodology
Create self-serve testing capabilities
Build reusable testing infrastructure
Celebrate learning, not just winning

Maturity levels:

Ad hoc: Occasional tests, no system
Developing: Regular tests, basic process
Defined: Consistent methodology, tracking
Optimized: Sophisticated testing, high velocity
Innovating: Experimentation culture embedded

When to Use

When you have sufficient traffic for statistically valid tests
Looking to systematically improve conversion rates
Want to validate growth ideas before full investment
Building culture of data-driven decision making
When intuition differs from data on what works
Testing new features or changes before broad rollout
Optimizing marketing channels and messaging

Verification

Hypotheses are specific, measurable, and evidence-based
Experiments have sufficient sample size for validity
Results are analyzed with proper statistical rigor
Learnings are documented regardless of outcome
Experiment velocity is tracked and improving
Winners are implemented, not just declared

ge - Growth Experiments