Automated Extraction Pipeline

Overview

Industrial-scale automation pipeline for extracting procedures from 70+ YouTube channels and other sources using the Ralph Wiggum / Conductor pattern for iterative LLM-driven extraction.

Steps

Step 1: Acquire transcripts

For each source in queue, get clean text:

YOUTUBE VIDEOS:

Try YouTube Transcript API (manual captions preferred)
Fall back to Whisper transcription if needed
Store with timestamps for source location

PDF PAPERS:

Extract text with PyMuPDF or pdfplumber
Fall back to OCR for scanned documents
Preserve page numbers

PODCASTS/AUDIO:

Download audio
Transcribe with Whisper (medium model recommended)
Store with timestamps

TOOL DOCUMENTATION:

Scrape with requests + BeautifulSoup
Convert HTML to Markdown
Preserve structure

Output standardized format:

source_type, source_id, source_metadata
full_text, segments (with timestamps/pages)
quality assessment (manual | auto | whisper | ocr)

Step 2: Run extraction loop (Pass 1 - Explicit)

For each transcript, run iterative explicit extraction:

PROMPT TEMPLATE: “Find ALL explicitly stated procedures. Look for: ‘Here’s how to…’, ‘The steps are…’, numbered lists, direct explanations of HOW to do something. Output in YAML format with name, type, confidence, steps, gaps.”

ITERATION LOGIC (Ralph Wiggum Pattern):

Call LLM with transcript + prompt
Parse YAML procedures from response
Check for “EXTRACTION_STATUS: COMPLETE” signal
If not complete, loop with accumulated extractions as context
Stop when: complete signal, max 3 iterations, or diminishing returns

Mark all procedures with:

type: explicit
confidence: HIGH (verbatim quotes support it)
source_location: timestamp or page

Step 3: Run extraction loop (Pass 2 - Implicit)

Extract procedures hidden in behavior and examples:

PROMPT TEMPLATE: “Find procedures NOT explicitly stated but inferred from: Pattern matching (every time X, they do Y) Consistent behaviors, unstated steps, Decision criteria, error handling. Output with observed_pattern, evidence, reconstructed_steps.”

ITERATION LOGIC:

Include explicit extractions as context
Focus on what they DO vs what they SAY
Max 3 iterations
Require evidence from source for each inference

Mark all procedures with:

type: implicit
confidence: MEDIUM
uncertainty: what’s not sure
validation_needed: how to verify

Step 4: Run extraction loop (Pass 3 - Meta)

Extract HOW they think, learn, teach, and improve:

PROMPT TEMPLATE: “Find META-PROCEDURES about: Learning (how they acquire knowledge) Teaching (how they explain things) Thinking (how they reason and decide) Improvement (how they get better) These are procedures about procedures.”

FOCUS AREAS:

How do they introduce complex topics?
What examples do they choose and why?
How do they handle potential objections?
What mental models do they use?

Mark all procedures with:

type: meta
category: learning | teaching | thinking | improvement
why_valuable: how this can be applied elsewhere

Step 5: Run extraction loop (Pass 4 - Tacit)

Surface knowledge they have but don’t state:

PROMPT TEMPLATE: “Excavate TACIT KNOWLEDGE through: Assumption surfacing (what must be true?) Expert blind spot detection (what do they skip?) Failure mode inference (what could go wrong?) Context dependency mapping (when would this NOT work?)”

TECHNIQUES:

“What would a beginner miss?”
“What do they ‘just know’?”
“What warnings would an expert give?”

Mark all extractions with:

type: tacit
confidence: LOW (needs validation)
what_is_assumed: the unstated knowledge
if_missing_consequence: what goes wrong without it

Step 6: Validate extractions

Run automated validation on all extractions:

STRUCTURAL CHECKS:

YAML parses correctly
Required fields present (name, type, steps)
Steps are non-empty
No placeholder text ([brackets], TODO)

SEMANTIC CHECKS:

Name is descriptive (3+ words)
Steps start with verbs (actionable)
Source citations are valid

CONSISTENCY CHECKS:

No exact duplicates within source
HIGH confidence has verbatim quote
Type matches extraction characteristics

CALCULATE CONFIDENCE:

explicit_quote: present/partial/absent
multiple_evidence: 3+/2/1/inference
step_completeness: all clear/some gaps/major gaps
source_clarity: directly stated/implied/inferred/speculation

Flag for human review if:

Average confidence < MEDIUM
50% are tacit type
3 validation issues
Contradictory procedures found

Step 7: Deduplicate across sources

Prevent duplicate procedures across sources:

DETECTION METHODS:

Name similarity (fuzzy match, threshold 0.85)
Step similarity (longest common subsequence, threshold 0.80)
Embedding similarity (cosine distance, threshold 0.90)

RESOLUTION STRATEGIES:

Exact duplicate: Keep first, discard duplicate
Cross-source duplicate: Keep both, link as variants
Similar but different: Keep both, create procedure family
Enhanced version: Merge, keeping best parts

Document all deduplication decisions.

Step 8: Create procedure files

Write GOSM-compatible YAML files:

PATH DETERMINATION:

YouTube: library/procedures/extracted/youtube/{channel}/{procedure}.yaml
Papers: library/procedures/extracted/papers/{author}_{year}/{procedure}.yaml
Books: library/procedures/extracted/books/{author}/{procedure}.yaml
Tools: library/procedures/extracted/tools/{tool}/{procedure}.yaml

FILE CONTENT:

id, name, version, domain, description
source (origin, type, creator, url, location, extraction_date, confidence)
when_to_use, when_not_to_use
inputs, outputs
steps (with action, details, reasoning)
verification, failure_modes
notes (extraction metadata, gaps, confidence)

Validate YAML syntax before writing.

Step 9: Update indexes

Keep library indexes current:

INDEXES TO UPDATE:

library/procedures/extracted/EXTRACTED_PROCEDURES_INDEX.md
- Total procedures by source type
- Recent additions
- Domain groupings
library/procedures/extracted/EXTRACTION_LOG.md
- Date, source, creator
- Procedures extracted by type
- Time, tokens, cost
Source-specific indexes
- library/procedures/extracted/youtube/{channel}/INDEX.md
- Videos processed, procedures by video

Include in log:

Source metadata
Extraction statistics
Files created
Cost breakdown

Step 10: Generate extraction report

Create comprehensive report of extraction run:

SUMMARY:

Sources processed / failed
Procedures extracted (by type: explicit/implicit/meta/tacit)
Procedures flagged for review
Time elapsed
Tokens used
Cost incurred

BY SOURCE:

Each source with procedures extracted
Confidence distribution
Notable findings

QUALITY METRICS:

Validation pass rate
Confidence distribution
Deduplication rate

RECOMMENDATIONS:

Sources worth re-processing
Prompt improvements needed
Budget projections for remaining queue

Save report to output_directory/EXTRACTION_REPORT.md

When to Use

When processing more than 10 sources at once
When building initial procedure library at scale
When systematically processing a YouTube channel backlog
When you have approved budget for automated extraction
When sources have been prioritized and queued
When rapid library expansion is needed
After completing pilot extraction to validate prompts
When diminishing returns on manual extraction

Verification

All queued sources processed or documented as failed
Each pass ran to convergence for each source
Validation checks ran on all extractions
Duplicates identified and resolved
Procedure files are valid YAML
Indexes updated with all new procedures
Cost stayed within budget
Extraction report is accurate and complete

auep - Automated Extraction Pipeline

Automated Extraction Pipeline

Overview

Steps

Step 1: Acquire transcripts

Step 2: Run extraction loop (Pass 1 - Explicit)

Step 3: Run extraction loop (Pass 2 - Implicit)

Step 4: Run extraction loop (Pass 3 - Meta)

Step 5: Run extraction loop (Pass 4 - Tacit)

Step 6: Validate extractions

Step 7: Deduplicate across sources

Step 8: Create procedure files

Step 9: Update indexes

Step 10: Generate extraction report

When to Use

Verification

`auep - Automated Extraction Pipeline`