Evaluate

Input: $ARGUMENTS

Interpretations

Before executing, identify which interpretation matches the user’s input:

Interpretation 1 — Correctness check: The user wants to know if something is right (“Is this right?”, “Does this hold up?”). Interpretation 2 — Completeness check: The user wants to know if something is missing (“Am I missing anything?”, “Is this MECE?”). Interpretation 3 — Quality check: The user wants to know if something is good (“Is this good?”, “Is this solid?”, “Review my X”). Interpretation 4 — Risk check: The user wants to know what could go wrong (“What could go wrong?”, “Is this safe?”). Interpretation 5 — Assumption check: The user wants to know what’s hidden (“What am I assuming?”, “What’s hidden?”).

If ambiguous, ask: “Are you checking for correctness, completeness, quality, risk, or hidden assumptions?” If clear from context, proceed with the matching interpretation.

Core Principles

Evaluation requires criteria. “Is this good?” is unanswerable without “good by what standard?” If the user doesn’t specify, derive criteria from context or ask.
Self-evaluation must be adversarial. When evaluating your own prior output, increase adversarial rigor. Confirmation bias is strongest when reviewing your own work.
Evaluation is not confirmation. The purpose is to find weaknesses, gaps, and errors — not to validate. If the evaluation finds nothing wrong, verify your test was severe enough to have found problems if they existed. If the test was severe and passed, the work is strong — say so.
The type of evaluation matters. Correctness, completeness, quality, risk, and assumption checks use different tools and produce different outputs. Route precisely.

Routing Decisions

1. Extract What’s Being Evaluated

What is the user asking you to assess? A plan, an argument, a piece of work, a process, a conclusion?

2. Is This Actually Evaluation?

“X is true” → This is a claim to test, not work to evaluate. → INVOKE: /claim $ARGUMENTS
“Should I X?” → This is a decision. → INVOKE: /decide $ARGUMENTS
“Why is X?” → This is diagnostic. → INVOKE: /diagnose $ARGUMENTS
“What about X?” → This is an idea. → INVOKE: /viability $ARGUMENTS
“I think this is good” → Formalize and test. → INVOKE: /it $ARGUMENTS
“This seems good, but X” → Tension to resolve. → INVOKE: /but $ARGUMENTS
“I’m not sure if this is good” → Classify the uncertainty. → INVOKE: /nsa $ARGUMENTS
“Handle this” (vague) → INVOKE: /handle $ARGUMENTS
If it IS evaluation → continue.

3. What Kind of Evaluation?

Correctness (“Is this right?”, “Does this hold up?”): test the core claims. → INVOKE: /araw [core claims extracted from the input]
Completeness (“Am I missing anything?”, “Is this MECE?”): check for gaps. → INVOKE: /mv [structure] for MECE validation → INVOKE: /se [space] to find what’s missing → INVOKE: /siycftr $ARGUMENTS to find implied but omitted items
Quality (“Is this good?”, “Is this solid?”): validate against criteria. → INVOKE: /pv [procedure/plan] for procedure validation → INVOKE: /val [conclusion] for multi-check validation
Assumptions (“What am I assuming?”, “What’s hidden?”): surface assumptions. → INVOKE: /aex [content]
Risks (“What could go wrong?”, “Is this safe?”): anticipate failure. → INVOKE: /fla [plan] for failure anticipation → INVOKE: /prm [plan] for pre-mortem → INVOKE: /saf [plan] for safety analysis → INVOKE: /obo [plan] for obvious bad outcomes

4. Is There a Standard?

Yes (“Is this MECE?”, “Does this meet the requirements?”, “Is this consistent with X?”): check against the stated standard.
No (“Is this good?”): need to establish criteria first. Ask: “Good by what measure?” or derive criteria from context.
Platitude standard (“Is this best practice?”, “Is this the right way?”): operationalize first. → INVOKE: /platitude on the standard — then evaluate against operationalized criteria.

5. Whole or Part?

Whole (“Review my plan”): assess the full thing. If large, decompose first. → INVOKE: /dcm to decompose, then evaluate each part.
Part (“Is step 3 right?”): focus on the specific part.

6. Self-Evaluation or External?

Self (evaluating own prior output): increase adversarial rigor — actively look for confirmation bias. → Also INVOKE: /sdc $ARGUMENTS to check for self-deception.
External (evaluating user’s work): balanced assessment.

7. Depth and Mode Selection

Situation	Mode
User wants quick check	→ /ezy (easy mode)
User wants maximum rigor	→ /hrd (hard mode) or /certainty
User wants general principles extracted	→ /genl
User wants specific application assessment	→ /spcf
User wants sophisticated multi-layer assessment	→ /soph

Execute

Default (general “is this good?”): → INVOKE: /araw [core claims from input] — test whether the key claims hold

For completeness checks: → INVOKE: /mv [structure] → INVOKE: /siycftr [to find missing implied items]

For assumption surfacing: → INVOKE: /aex [content]

For risk assessment: → INVOKE: /fla [plan] → INVOKE: /obo [plan] for obvious bad outcomes → INVOKE: /saf [plan] if safety-relevant

Supplementary Analysis (invoke when relevant)

Situation	Also invoke
Need to check obvious things first	→ /obv (obvious check)
Good outcomes being overlooked	→ /ogo (obvious good outcomes)
Bad outcomes being ignored	→ /obo (obvious bad outcomes)
Evaluation involves ethical dimensions	→ /eth
Evaluation involves safety	→ /saf
Need to trace implications	→ /sycs (so you can see)
Need to differentiate between similar items	→ /difr
Need future projections for the evaluated item	→ /fut
Best-case outcome needed	→ /utp
Worst-case outcome needed	→ /dys
Work has unresolved decisions	→ /tbd
Need to expand implied items	→ /etc or /aso
Evaluation involves argument structure	→ /agsk
User wants debate format	→ /deb

Failure Modes

Failure	Signal	Fix
Evaluation without criteria	”It’s good” without stating what “good” means	Establish criteria before evaluating
Confirmation evaluation	Everything comes back positive	Increase adversarial rigor — actively look for flaws
Wrong evaluation type	Checking correctness when user wanted completeness	Re-read the input — what kind of evaluation was asked for?
Surface evaluation	Checked obvious things, missed deep structural issues	Go deeper — check assumptions, not just visible claims
Self-confirmation	Evaluating own output and finding it great	Use /sdc — self-deception check

Depth Scaling

Depth	Scope	Floor
1x	Quick check — test core claim, spot-check completeness	5 claims, 10 findings
2x	Standard — test core claims, check completeness, surface assumptions	8 claims, 18 findings
4x	Thorough — all claims tested, full completeness check, risks assessed	12 claims, 30 findings
8x	Deep — everything above plus pre-mortem, safety check, alternatives	18 claims, 50 findings

Pre-Completion Checklist

What’s being evaluated is clearly stated
Evaluation type identified (correctness / completeness / quality / assumptions / risks)
Criteria established (stated or derived)
Appropriate evaluation skill(s) invoked
Findings organized by severity / importance
Verdict stated with confidence level
Specific improvements recommended

After Completion

Report:

What was evaluated
Assessment type used (correctness / completeness / quality / assumptions / risks)
Criteria used
Findings (what’s strong, what’s weak, what’s missing)
Verdict with confidence level
Specific improvements recommended

Follow-Up Routing

After evaluation, the user may need:

“How do I fix this?” → INVOKE: /how $ARGUMENTS
“What should I do?” → INVOKE: /decide or /action
“Iterate on this” → INVOKE: /iterate
“What are the implications?” → INVOKE: /sycs
“What skill should I run next?” → INVOKE: /next or /fonss
“What’s still unresolved?” → INVOKE: /tbd

Integration

Use from: /action (after execution, evaluate the result), /create (after content production, evaluate quality), /how (after method chosen, evaluate the plan)
Routes to: /araw (correctness testing), /mv (MECE validation), /aex (assumption extraction), /fla (failure anticipation), /pv (procedure validation), /val (multi-check validation), /saf (safety), /obo (obvious bad outcomes)
Differs from: /claim (evaluate assesses work, claim tests a proposition), /viability (evaluate assesses existing work, viability tests a proposed idea)
Complementary: /iterate (after evaluation identifies issues, iterate to fix them), /obv (check the obvious first), /siycftr (find implied missing items)

evaluate - Assess Something