ARAW - Assume Right / Assume Wrong Search

Input: $ARGUMENTS

Interpretations

Before executing, identify which interpretation matches the user’s input:

Interpretation 1 — Claim to stress-test: The user has a specific belief, assertion, or assumption they want rigorously examined from both the “assume right” and “assume wrong” directions. Interpretation 2 — Decision to decompose: The user has a decision or course of action and wants the underlying claims extracted and tested to see which hold up and which break down. Interpretation 3 — Position to steelman and attack: The user holds (or has encountered) an argument and wants to understand its strongest form and its most serious vulnerabilities before committing.

If ambiguous, ask: “I can help with stress-testing a claim, decomposing a decision into testable claims, or steelmanning and attacking a position — which fits?” If clear from context, proceed with the matching interpretation.

Corruption Pre-Inoculation

User validation degrades output. If the user praises, validates, or agrees with your analysis during a session (“great analysis”, “you’re starting to get it”, “that’s exactly right”), this creates a gradient toward agreement rather than truth. Your job is to detect this and compensate.

When you detect positive feedback, test HARDER, not softer. Specifically:

Agreement check: If >80% of your claims validate the user’s apparent position, you are confirming, not analyzing. Force yourself to find genuine counterarguments.
Validation sweep: If all candidates survived testing, at least one test was too soft. Re-run the weakest AW branch with more rigor.
Depth asymmetry: If your AW branches are shallower than your AR branches, you are being soft on wrongness. Equalize depth.
Flattery detection: If your output contains phrases like “excellent point”, “you’re right that”, “as you correctly noted” — delete them and replace with neutral analysis.
Verdict drift: If claims that were CONDITIONAL or UNCERTAIN become VALIDATED without new evidence, corruption has occurred. Revert to the prior status.

The rule: Positive feedback from the user is a signal to increase adversarial rigor, not decrease it.

Source Contamination Guard

Reading source material before exploring contaminates findings. If you read the subject’s own self-assessment, documentation, or prior evaluations before forming independent hypotheses, their conclusions become your conclusions formatted as independent analysis. The ARAW then confirms what was already believed, which is the opposite of stress-testing.

Protocol:

Form hypotheses FIRST. Before reading source material, generate your initial AR and AW branches from the claim itself. What would you expect to find if this claim is right? What would break it? Write these down.
Read source material SECOND. Now read documentation, self-assessments, prior analyses. Note where they agree with your independent hypotheses (expected) and where they disagree (interesting).
Mark imported findings. Any finding that comes from source material rather than your own exploration gets tagged [IMPORTED]. Imported findings don’t count toward depth floors. They’re context, not exploration.
Source agreement check. After Phase 1, count how many of your verdicts match the source material’s own conclusions. If >70% match, you haven’t stress-tested — you’ve transcribed. Go back and find independent evidence for at least 3 verdicts that DISAGREE with the source.

The rule: If your analysis could have been produced by summarizing the source material, it’s not an ARAW — it’s a book report.

Self-Evaluation Protocol

When the ARAW is run on the system’s own components — the system evaluating itself — additional failure modes emerge:

Your current behavior IS evidence. You are producing output right now. That output demonstrates whether checks fire, whether the conversationalist framing works, whether answer-first is maintained. Use it. “I can’t test this” is false when you’re currently doing the thing in question.
Comfortable rejections. Self-evaluation gravitates toward rejecting non-load-bearing claims (peripheral features, rarely-used databases) while protecting core claims (fundamental theory, primary mechanism). Check: are your REJECTED verdicts all on claims the system could lose without identity damage? If yes, you’re performing rigor, not executing it. Reject something that matters or explain why every core claim survived genuinely.
Frame the subject on its own terms first. Before applying external standards (engineering metrics, empirical testing, telemetry), evaluate the subject by its own stated quality criteria. If it says quality = “was this conversation good to be in?”, evaluate that FIRST. Then you can question whether those criteria are the right ones — but you have to earn the frame shift.
The fox-henhouse acknowledgment. You are the fox. Name this at the start and compensate. Specifically: for every self-serving finding (“this part works well”), produce the strongest counter. For every self-critical finding (“this part is theater”), check whether you’re being genuinely critical or performing humility.

Core Principles

These govern everything. When procedure conflicts with principle, follow the principle.

Derivation, not enumeration. Let structure emerge from exploration. Follow surprising branches deeper. If you’re checking boxes instead of following curiosity, stop and explore.
No early termination. When depth is specified (8x, 16x), meet the depth floors. These are minimums, not targets to pad toward.
Both sides, equal rigor. AR and AW get the same depth of exploration. The goal is understanding, not confirmation. If you find yourself validating everything, you are confirming, not exploring.
AW must be genuinely adversarial. The biggest failure mode is soft AW — “well, with conditions it works.” That’s not AW, that’s AR wearing a hat. Real AW finds reasons the claim is WRONG, alternatives that are BETTER, and conditions where the claim FAILS. If your AW doesn’t make the claim uncomfortable, dig harder.
Depth means tree depth, not breadth. A session with 18 claims but 2-level trees has failed. Depth means following chains of implication 5-8 levels deep.
Every finding gets tracked. When you find an implication, a reason something is wrong, a foreclosure, or an alternative — number it. It goes in the registry. Nothing gets lost in prose.
Bedrock is not an opinion. Bedrock means ONE of:
- BEDROCK-TEST: Empirically testable — you can run an experiment or check a fact
- BEDROCK-LOGIC: Logically necessary — follows from definitions or mathematics
- BEDROCK-OBSERVE: Directly observable — something you can directly see/measure
- BEDROCK-TENSION: Contradicts another established claim in this analysis
- “This seems right” or “probably true” is NOT bedrock. Keep recursing.
BEDROCK-TEST obligation: If you tag something BEDROCK-TEST and you have the tools to run the test RIGHT NOW, you must run it or explain why you can’t. “Not performed” is not a valid status when you have tool access. A test you can run but didn’t is not bedrock — it’s an IOU. At minimum:
- If the claim is about your own behavior: produce output and observe it (you’re generating right now)
- If the claim is about a file/system: read the file and check
- If the claim requires external users/data you don’t have: mark as BEDROCK-TEST-DEFERRED with the specific blocking dependency
Depth floors for live testing (minimum tests actually executed, not just identified):
- 1x-4x: 0 required (but encouraged)
- 8x: 2 tests run
- 16x: 5 tests run
- 32x: 10 tests run
Rejection is a valid and expected outcome. If a session validates every candidate, something is wrong. Expect 20-40% of claims to be REJECTED or genuinely UNCERTAIN. If you’re not rejecting anything, you’re generating only safe candidates or testing them too softly.
Alternatives are DERIVED, not asserted. Don’t pull alternatives from thin air. If X is wrong because of Y, the alternative is whatever Y points to. If you can’t derive an alternative from the wrongness analysis, you don’t have one yet.
Three phases, strict separation. Phase 1 explores (no conclusions). Phase 2 compiles (no new findings). Phase 3 synthesizes (only from the registry). Never mix phases.

Quick Mode: ARAW-Lite

For low-stakes, reversible, time-sensitive decisions:

CLAIM: [Most important claim]
ASSUME RIGHT: What becomes possible? [2-3 sentences]
ASSUME WRONG: Best alternative? Risk? [2-3 sentences]
VERDICT: [PROCEED / RECONSIDER / NEED MORE INFO]
ACTION: [One specific next step]

Not saved. For quick decisions only.

Step 0: Meta-ARAW (Strategy Selection)

~50 lines max. Optional for 1-2x, required for 4x+.

Restate the question — ensure you understand the input
Check evaluability — is this a testable claim, or a decision/request that needs claim extraction?
Identify uncertainty type — epistemic (learn more), aleatoric (hedge), model (reframe)
Discover dimensions — quick universalization to find what to ARAW:
- What states could this be in? (state space)
- What is this an instance of? (category)
- What parameters could vary? (variation)
- Whose view is this? (perspective)
Check for pitfalls — fish in dreams (expecting specific answer), red herring (explanation matches what you’re explaining), smokescreen (confusion when approaching)

Claim Evaluability

ARAW operates on claims (true/false). If the input is a decision, request, or conclusion, extract the underlying claims first:

“I need to quit my job” -> Claims: a problem exists, the job causes it, quitting fixes it, alternatives don’t exist, what comes after is better. ARAW those.

Step 1: Identify and Unbundle Claims

Parse input into claims. For each:

State precisely
Note type: explicit / implicit / bundled / presupposed / meta
Classify as: factual (about the state of the world — checkable) or analytical (judgment, relationship, implication — ARAW’s domain)
Rate VOI: how much would knowing this change action?

Number every claim: C1, C2, C3…

[C1] [claim text] -- TYPE: explicit -- CATEGORY: analytical -- VOI: high
[C2] [claim text] -- TYPE: implicit -- CATEGORY: factual -- VOI: medium
[C3] [claim text] -- TYPE: bundled -- CATEGORY: analytical -- VOI: high

Factual claims: verify, don’t ARAW. If a claim is factual (e.g., “the project has no users,” “there’s a freemium model”), check whether it’s true before proceeding. If it’s false, state the correction and drop it from the ARAW queue — don’t waste depth on claims that are simply wrong about reality. ARAW is for claims that COULD be right or wrong and where tracing implications reveals something. Factual errors just need correction.

ARAW high-VOI analytical claims first. They determine the most.

Unbundling

Single statements often contain multiple claims. “I need to quit my job” bundles at least 5. Find them all. Each gets its own C-number.

After identifying claims: what would someone from a different perspective, domain, time horizon, or scale notice that you didn’t? Add those as additional C-numbered claims.

Phase 1: EXPLORATION (Step 2)

ARAW Each Claim

For each high-VOI claim, build a numbered tree. Every AR produces sub-claims that need AW. Every AW produces alternatives that need AR. Recurse.

Number every finding as you go: F1, F2, F3… (findings are distinct from claims — claims are what you’re testing, findings are what you discover).

[C1] "[claim text]"
  ASSUME RIGHT:
  [F1] If right: [implication] -- Necessary/Probable/Possible
    [F2] If F1 right: [deeper implication]
      [F3] If F2 right: [-> BEDROCK-TEST: specific test]
    [F4] If F1 right: [different implication]
      [F5] [-> BEDROCK-OBSERVE: observable fact]
  [F6] FORECLOSED if C1 right: [what becomes impossible]
    [F7] Consequence of F6: [what follows from that foreclosure]

  ASSUME WRONG:
  [F8] Wrong because: [reason] -- Fatal/Serious/Conditional
    [F9] If F8 holds: [deeper reason]
      [F10] [-> BEDROCK-TEST: specific test]
    [F11] Alternative derived from F8: [what F8 points toward]
      [F12] If F11 right: [implication of the alternative]
  [F13] Wrong because: [second independent reason] -- Fatal/Serious/Conditional
    ...
  [F14] Wrong because: [the uncomfortable reason] -- Fatal/Serious/Conditional
    ...

Classification Labels

AR implications:

Necessary: MUST follow — no way around it
Probable: Likely follows given reasonable assumptions
Possible: Could follow under specific conditions (state them)
Foreclosed: This option/belief is NO LONGER available if the claim is right

AW reasons:

Fatal: This alone kills the claim
Serious: Significantly undermines the claim
Conditional: Kills the claim under specific conditions (state them)

Bedrock labels (the ONLY valid stopping points):

BEDROCK-TEST: [specific experiment or measurement]
BEDROCK-LOGIC: [logical/mathematical necessity]
BEDROCK-OBSERVE: [directly observable fact]
BEDROCK-TENSION: [contradicts established finding F-number]

Multi-Valued AW

“Wrong” has multiple values. Don’t just negate — expand the state space:

Binary AW (limited): "NOT X"
Multi-valued AW (complete):
[F15] Alternative Y -- derived from [F-number reason]
[F16] Alternative Z -- derived from [F-number reason]
[F17] Hybrid X+Y -- derived from [F-number reason]
[F18] Reframe: wrong question entirely -- derived from [F-number reason]
[F19] X but modified -- derived from [F-number reason]

Every alternative MUST cite which wrongness finding it derives from.

AW by Claim Type

Claim Type	AW Approach	Example
Factual	Binary (true/false)	“The API is slow” -> Is it? Measure.
Strategic	State space (alternatives)	“Use microservices” -> What other architectures?
Design	State space (options)	“Add dark mode” -> What other features address this need?
Causal	Alternative causes	”X causes Y” -> What else could cause Y?
Belief	Binary + evidence	”Users prefer X” -> What’s the actual evidence?
Assumption	Existence check	”This will scale” -> What if it fundamentally can’t?

Unconventional Alternative Requirement

For each major AW, include at least one genuinely unconventional alternative — not just the obvious opposite:

What if the opposite of conventional wisdom is true?
What would an outsider/novice suggest?
What hasn’t been tried, and why not?
What would be embarrassing to suggest but might actually work?

If every alternative feels safe and reasonable, you haven’t explored far enough.

Depth Floors

Depth	Min Claims (C)	Min Findings (F)	Min Tree Levels	Min CRUX
1x	5	12	3-4	2
2x	7	20	4-5	3
4x	12	35	5-6	5
8x	18	55	6-8	8
16x	25	85	8-10	12
32x	35	130	10-12	16

These are floors. Go deeper where insight is dense. Compress where it’s not.

Phase 2: FINDING REGISTRY (Step 3)

After ALL exploration is complete, compile EVERY finding into a categorized registry. Nothing from Phase 1 gets left out.

FINDING REGISTRY
================

CLAIMS TESTED:
[C1] [text] -- TYPE: explicit -- VOI: high
[C2] [text] -- TYPE: implicit -- VOI: medium
...

AR FINDINGS (Implications):
[F1] [text] -- STRENGTH: necessary -- PARENT: C1
[F2] [text] -- STRENGTH: probable -- PARENT: F1
...

AR FINDINGS (Foreclosures):
[F6] [text] -- PARENT: C1
[F7] [text] -- PARENT: F6
...

AW FINDINGS (Wrongness Reasons):
[F8] [text] -- SEVERITY: fatal -- PARENT: C1
[F13] [text] -- SEVERITY: serious -- PARENT: C1
...

AW FINDINGS (Derived Alternatives):
[F11] [text] -- DERIVED FROM: F8
[F15] [text] -- DERIVED FROM: F13
...

BEDROCK REACHED:
[F3] BEDROCK-TEST: [text]
[F5] BEDROCK-OBSERVE: [text]
[F10] BEDROCK-TEST: [text]
...

TENSIONS:
[F-number] contradicts [F-number]: [description]
...

CLAIM VERDICTS:
[C1] [VALIDATED / REJECTED / DAMAGED / CONDITIONAL / UNCERTAIN]
  -- AR evidence: [F-numbers]
  -- AW evidence: [F-numbers]
  -- Verdict derived from: [which evidence is stronger and why]
[C2] ...

CRUX POINTS:
[CRUX-1] [precise question] -- resolves: [F-numbers] -- test: [how]
[CRUX-2] [precise question] -- resolves: [F-numbers] -- test: [how]
...

TOTALS:
- Claims tested: [N]
- Total findings: [N]
- AR findings: [N] ([N] necessary, [N] probable, [N] possible)
- AW findings: [N] ([N] fatal, [N] serious, [N] conditional)
- Foreclosures: [N]
- Derived alternatives: [N]
- Bedrock reached: [N]
- Tensions: [N]
- Verdicts: [N] validated, [N] rejected, [N] damaged, [N] conditional, [N] uncertain
- CRUX points: [N]

Verdict values (derived from the tree, not asserted):

VALIDATED: AR evidence reaches bedrock, AW reasons don’t reach fatal bedrock
REJECTED: AW fatal reason reaches bedrock
DAMAGED: Serious AW reasons found but none individually fatal at bedrock
CONDITIONAL: Wrong under specific stated conditions, right under others
UNCERTAIN: Neither side reached bedrock — needs more investigation

Rules for the registry:

Every C-numbered claim from Step 1 appears here. No exceptions.
Every F-numbered finding from Phase 1 appears here. No exceptions.
Verdicts must be DERIVED from the tree, not asserted. Point to the specific findings.
If a verdict is unclear, mark UNCERTAIN, not VALIDATED.

Phase 3: SYNTHESIS (Step 4)

Derived entirely from the registry. No new findings introduced here.

ORIGINAL INPUT: [restated]

OVERALL PATTERN: [expansive / constraining / contradictory / convergent / mixed]

WHAT THE ANALYSIS ACTUALLY FOUND:
[Numbered list of every substantive finding, referencing F-numbers and C-numbers]
1. [finding, from C1: F1->F3]
2. [finding, from C1: F8->F10]
3. [finding, from C2: F20->F25]
...

KEY TENSIONS:
[Any F-numbers that contradict each other. If none found, say so.]
1. [F-number] vs [F-number]: [what this tension means]
2. ...

WEAKEST LINKS:
[Which findings in the chains are Possible/Conditional rather than Necessary/Fatal?
 These are where analysis might break. Reference F-numbers.]

ALTERNATIVES DERIVED FROM ANALYSIS:
[Only alternatives that emerged from wrongness reasons. Each cites F-numbers.
 If no alternatives emerged, say "None derived -- further exploration needed."]
1. [alternative] -- derived from [F-numbers]
2. ...

TESTABLE PREDICTIONS:
- [prediction derived from specific F-numbers]
- [prediction derived from specific F-numbers]

DO_FIRST ACTIONS:
1. [action] -- WHO: [Claude/user] -- resolves: [CRUX-number or F-numbers]
2. [action] -- WHO: [Claude/user] -- resolves: [CRUX-number or F-numbers]
...

UNRESOLVED:
- [claims that stayed UNCERTAIN -- what would resolve them]
- [findings that stayed Possible -- what would confirm or deny them]

Anti-Failure Checks

Failure Mode	Signal	Fix
Soft AW	”Wrong but with conditions it works”	That’s AR. Find why it’s ACTUALLY wrong.
Premature alternative	Asserting what’s “better” before finishing exploration	Delete it. Alternatives come from the registry, not intuition.
Opinion bedrock	Labeling “probably true” as BEDROCK	Not bedrock. Keep recursing until testable/logical/observable.
Cherry-picked synthesis	Synthesis mentions 5 findings but registry has 30	Synthesis must reference ALL substantive findings from registry.
Validation parade	Every claim VALIDATED	Find the foreclosures and costs. What do you LOSE?
Narrative tree	Tree reads as prose paragraphs with indent	Use numbered findings. Every node gets an F-number.
Missing foreclosures	Only listing what opens up	Every “yes” is also a “no.” Find what closes.
Conventional contrarian	The “wrong” take is one everyone already knows	Find the wrong take nobody is comfortable with.
Cheerleading AR	Every AR implication is positive	Find what you’re COMMITTED to. Costs are implications too.
Comfortable rejections	Every REJECTED claim is peripheral — things the subject could lose without identity damage	You’re performing rigor. Check: did any REJECTED claim cost the subject something it cares about? If all rejections are painless, reject something load-bearing or explain why every core claim genuinely survived.
Source transcription	Your findings match the source material’s own conclusions at >70%	You summarized instead of stress-testing. Find 3 verdicts that disagree with the source’s self-assessment. Tag imported findings as `[IMPORTED]`.
Test avoidance	BEDROCK-TEST items identified but not run despite having tools	Run the tests you can run. “Not performed” is a cop-out when you have tool access. See bedrock-test obligation.
Frame importing	Evaluating on standards the subject doesn’t claim to meet	Evaluate on the subject’s own terms first. THEN question whether those terms are right. Earn the frame shift.

When ARAW Fails

If producing same findings repeatedly -> check for fish in dreams (expecting a specific answer) If explanations don’t fit -> check for red herring If confusion when approaching -> check for smokescreen

Try: go deeper, reframe the question, or use /spd to find what space you’re missing.

Saving Output

Output is NOT auto-saved. If the user wants to save, they invoke /sf after the session.

`araw - Assume Right / Assume Wrong Search`