0
#1
check in with what you know about the thread (or use another agent to re-read the thread) and confirm, will this satisfy userb
HUMAN
I now have a comprehensive picture. Let me compile the complete findings.
---
## Complete Thread Analysis
### Who is userb?
**Handle:** `userb.bsky.social`
**Display Name:** User B
**Role:** She is a **paralegal** (she explicitly says "I'm a paralegal" in the thread). She works in law, uses Westlaw deterministically, and has direct experience evaluating LLM tools for legal work. She appears to have significant visibility in the legal tech skeptic space -- she says "given my visibility in this space, if it worked as advertised, my recommendation would plausibly go a long way."
She is **the primary person who "goaded" user-a into building the Table of Authorities extraction tool.** User A explicitly confirms this in thread 1:
> "User B is a true poster in it for the love of the game. I had a great time arguing and I commend her for finally figuring out the way to nerd snipe me is to get 10 lawyers together to tell me about their UX problem and bet me I can't use Claude to solve it"
And:
> "I should clarify that I didn't feel User B implied this, only the weird clowns I muted who kept trying to agree with her"
---
### What User B Said and Wanted -- The Full Arc
The conversation spans from **Feb 27 to Mar 2, 2026** and has roughly **200+ posts** from User B alone. Here is the substantive arc:
#### Phase 1: Initial challenge (Feb 27)
She enters the thread when User A says lawyers who got sanctioned for using LLMs should have "used true citations":
- **"What do you mean by 'true ones'?"**
- **"The 'correctly' is the tricky bit there"**
- **"The problem is that in legal work, the only validator is the judge and you only get one run at it"**
- **"All the arguments that everyone is making are ultimately this argument. Using an LLM for legal work is like using an LLM to write code that will be pushed directly to prod without QC and can never be rolled back."**
- **"Oh, but there is -- it doesn't reason."**
She also named specific legal professionals involved: "one of whom is literally counsel to the NTEU" and linked to @anon-28.bsky.social (Will Li).
#### Phase 2: The citation test challenge (Feb 28)
This is where she sets up the test that eventually goads User A into building the tool:
- **"Would you be willing to run a test with them for me?"**
- She provides a Supreme Court brief: `www.supremecourt.gov/DocketPDF/20/20-493/203906/...`
- **"Please prompt your LLM to identify every citation to legal authority in this brief, by page number, line number, and the text of the citation. It should ONLY identify citations to legal authority."**
- **"This is a task that can be easily accomplished by an untrained 12-year-old, btw."**
When the results come back, she evaluates them:
- **"Okay, yes, these have both failed in the same way, which is exactly the way I expected. Because they don't reason, they were unable to distinguish between *citations* to legal authority and mere *references* to legal authority -- they identified all references
… (12307 chars total){}Entered plan mode. You should now focus on exploring the codebase and designing an implementation approach.
In plan mode, you should:
1. Thoroughly explore the codebase to understand existing patterns
2. Identify similar features and architectural approaches
3. Consider multiple approaches and their trade-offs
4. Use AskUserQuestion if you need to clarify the approach
5. Design a concrete implementation strategy
6. When ready, use ExitPlanMode to present your plan for approval
Remember: DO NOT write or edit any files yet. This is a read-only exploration and planning phase.Now let me generate a comprehensive summary. Here's what I've found:
## Summary of Findings
### 1. PDF Text Extraction Tools Available
**On macOS system:**
- `pdftotext`: NOT available
- `poppler`: NOT available (via brew)
- Python PDF libraries (pymupdf/fitz, pdfplumber, pypdf, PyPDF2, pdf2image, reportlab, pdfrw, pikepdf): **NONE installed**
**What IS available:**
- Python 3.14.2 installed
- Basic pip packages only (certifi, cffi, cryptography, pycparser, wheel)
- **No PDF extraction infrastructure currently on the system**
**This is critical**: To extract text from PDFs, you'll need to install at least one library. Recommended options:
- `pdfplumber` - most beginner-friendly for text extraction and TOA work
- `PyPDF2` - lightweight, no extra dependencies
- `pymupdf/fitz` - fastest, but more complex
- `pypdf` - modern successor to PyPDF2
### 2. The Four PDF Briefs
| File | Size | Type | Pages Detected |
|------|------|------|----------------|
| brief-ysleta-petitioners.pdf | 592 KB | PDF v1.6 | 0 pages reported |
| brief-ysleta-opposition.pdf | 408 KB | PDF v1.6 | 21 pages |
| brief-tx-5th-dist.pdf | 3.5 MB | PDF v1.6 | Unknown |
| brief-cadc.pdf | 41 MB | PDF v1.7 | Unknown (very large) |
**Note**: The petitioners brief reports 0 pages in the file metadata - this may be a metadata issue or require inspection with actual PDF reading.
**Status**: No page directories have been created yet (no `pages/` subdirectories exist).
### 3. Existing Infrastructure & Documentation
**Already in place:**
- ✓ `extract-toa-skill.md` (202 lines) - detailed extraction skill with:
- Full citation rules (cases, statutes, legislative materials, other authorities)
- Exclusion rules (record references, jurisdictional statutes, self-references)
- Short-form resolution patterns
- Page tracking rules
- Output formatting standards
- Execution model: parallel subagent chunks + verification phase + merge phase
- ✓ `test-toa-skill.md` (49 lines) - testing framework for comparing extraction against ground truth
- ✓ `error-analysis.md` (103 lines) - detailed error analysis from a test run against Ysleta opposition brief showing:
- 83 extracted authorities vs. 99 in ground truth
- 53 correct matches, 21 false negatives, 7 page discrepancies, 7 ambiguous
- Key error patterns identified (parenthetical-only exclusion too aggressive, passim undercounting, statute granularity issues, roman-numeral page skipping)
- ✓ `toa-draft.md` (100 lines) - sample TOA output showing structure and format
- ✓ `thread-links.md` - source references (GitHub gists, Supreme Court briefs, D.C. Circuit brief, TX state court brief)
- ✓ `claude` CLI v2.1.62 available at `/opt/homebrew/bin/claude` - can run subagents
- ✓ `.claude/settings.local.json` - project-level settings (Bluesky API permissions only)
### 4. Ground Truth Available
The error analysis document references a ground truth TOA for the Ysleta opposition brief with:
- 52 case citations
- 32 statute/rule cita
… (4669 chars total)I've read all four files. Here's my detailed structural analysis:
---
## EXECUTION MODEL
### extract-toa-skill.md
**Three-phase pipeline:**
1. **Phase 1: Per-chunk parallel extraction** — Split brief into ~10-page chunks, launch one Sonnet subagent per chunk. Each extracts citations with metadata (page, category, form, parenthetical, short-form mapping, ID status, exclusion flags).
2. **Phase 1.5: Verification (parallel, per-chunk)** — Haiku subagents verify extracted citations actually appear on claimed pages. Catches footnote boundary errors, signal-word spillover, multi-line citations.
3. **Phase 2: Merge (single subagent)** — Sonnet resolves short forms across chunk boundaries, de-duplicates, applies exclusion rules, computes page lists with passim logic, outputs final TOA.
**Front-matter handling:** Explicitly says extract roman-numeral pages — but error analysis shows this wasn't done.
---
## BAKED-IN ASSUMPTIONS & OVERFITS
### 1. **Parenthetical-only exclusion (BROKEN)**
- **Assumption:** Cases/statutes appearing only inside `(quoting ...)`, `(citing ...)` don't belong in TOA.
- **Reality:** Ground truth includes ALL parenthetical citations. This caused 13 of 21 false negatives — the most severe systematic error.
- **Scope:** This rule appears universally in legal briefs, but the skill got it backwards. Not SCOTUS-specific, but fundamentally wrong.
### 2. **Id. page-tracking rule (PARTIALLY BROKEN)**
- **Assumption:** `Id.` and `id. at [page]` never add pages — purely referential.
- **Exception carved out:** When `id.` abbreviates only the title/volume of a code (e.g., `id. § 2703` where id = 25 U.S.C.), each new section counts as a first citation of a distinct authority.
- **Reality:** Ground truth shows the exception applies, but merge agent didn't apply it aggressively enough. Restoration Act sections severely undercounted.
- **Scope:** Universal, not SCOTUS-specific.
### 3. **Generic defined-abbreviation short forms (UNDERPERFORMING)**
- **Assumption:** Generic abbreviations like "S. Rep." don't add pages; distinctive named acts like "Restoration Act" or "IGRA" do add pages and trigger passim consideration.
- **Rule:** "If the abbreviation could only refer to one specific authority, it adds pages."
- **Reality:** Ground truth marks Restoration Act and §§ 107(a)/(b) as passim; draft only captured pages where full citation appeared, then undercounted named-act references.
- **Scope:** Not SCOTUS-specific, but heavily weighted toward cases where a statute is central to the brief.
### 4. **Roman-numeral page extraction skipped**
- **Skill says:** "Extract citations from roman-numeral pages too."
- **What happened:** Extraction started at page 1, skipping page "I". Ground truth includes IGRA citation from page I.
- **Scope:** Applies to any brief with front matter (all of them). Not SCOTUS-specific.
### 5. **Filing/record exclusion too broad**
- **Assumption:** App., Pet. App., BIO, Pet. Br., Resp. Br., JA, Conditional Cross-Pet
… (13301 chars total)Good. Modern `uv` supports `uv run --with pdfplumber script.py` for zero-config dependency injection. Here is the complete architecture.
---
## Architecture: TOA Extraction Test Harness
### Summary of Constraints Discovered
1. **No PDF libraries installed.** `uv` (0.9.28) is available -- use `uv run --with pdfplumber` for zero-install script execution.
2. **`claude` CLI (2.1.62) cannot be nested** inside a Claude Code session. The test runner must be invoked from a terminal shell, not from within Claude Code.
3. **No `poppler` installed** -- cannot use `pdftotext` or Claude Code's built-in PDF reader. All PDF processing goes through `pdfplumber` via `uv run`.
4. **The 43MB CADC brief** will need page-count estimation first to determine chunking.
5. **The skill uses subagents** via Claude Code's Task tool, which means running the skill via `claude -p` with `--dangerously-skip-permissions` (for automated runs) will spawn those subagents automatically.
### File Layout
```
lawthing/
├── briefs/ # symlinks or the PDFs themselves
│ ├── ysleta-petitioners.pdf
│ ├── ysleta-opposition.pdf
│ ├── cadc.pdf
│ └── tx-5th-dist.pdf
│
├── scripts/
│ ├── pdf_to_pages.py # PDF -> per-page text files
│ ├── extract_ground_truth.py # Extract TOA from brief front matter
│ ├── score.py # Compare extraction output vs ground truth
│ └── run_all.sh # Orchestrator: runs pipeline across all briefs
│
├── data/
│ ├── ysleta-petitioners/
│ │ ├── pages/ # page-001.txt, page-002.txt, ...
│ │ ├── ground-truth.md # Extracted TOA from front matter
│ │ ├── extraction-output.md # Skill output
│ │ └── score.json # Scoring results
│ ├── ysleta-opposition/
│ │ ├── pages/
│ │ ├── ground-truth.md
│ │ ├── extraction-output.md
│ │ └── score.json
│ ├── cadc/
│ │ └── ...
│ └── tx-5th-dist/
│ └── ...
│
├── results/
│ ├── cross-brief-report.md # Aggregated cross-brief error analysis
│ └── iteration-log.jsonl # Append-only log of each iteration's scores
│
├── skills/
│ └── extract-toa/
│ └── SKILL.md # The skill under test (copy of extract-toa-skill.md)
│
├── extract-toa-skill.md # Original skill (existing)
├── error-analysis.md # Existing
├── test-toa-skill.md # Existing
└── toa-draft.md # Existing
```
### Component 1: `scripts/pdf_to_pages.py` -- PDF to Per-Page Text
**Purpose:** Extract text from a PDF, split into one file per page, with line numbers for debugging.
**Key design decisions:**
- Use `pdfplumber` (best text extraction for legal documents -- preserves columns, footnotes, headers).
- Output files named `page-001.txt`, `page-002.txt`, etc. First line of each file is the page number (matching the skill's expectations from section 1 of `extract-toa-skill.md`).
- Roman-numeral front ma
… (20660 chars total)[REDACTED]/eventual-frolicking-engelbart.mdFile created successfully at: [REDACTED]/eventual-frolicking-engelbart.md{
"allowedPrompts": [
{
"tool": "Bash",
"prompt": "install Python packages via uv or pip"
},
{
"tool": "Bash",
"prompt": "run Python scripts for PDF processing"
},
{
"tool": "Bash",
"prompt": "run claude CLI for extraction and ground truth parsing"
},
{
"tool": "Bash",
"prompt": "create directories for project structure"
},
{
"tool": "Bash",
"prompt": "run test/scoring scripts"
}
],
"plan": "# Plan: Cross-Brief TOA Extraction Test Harness + Skill Iteration\n\n## Context\n\nUser A built a Claude Code skill for extracting Table of Authorities from legal briefs. User B (a paralegal) correctly identified the core problem: the skill was developed against one brief and tested against one other — it's overtrained. She provided 4 briefs spanning SCOTUS, D.C. Circuit, and Texas state court, and wanted all tested. We need a pipeline that runs the skill against all briefs, scores it against ground truth, and lets us iterate until it generalizes.\n\nThe skill has already been partially fixed (parenthetical inclusion, filing exclusion narrowing, statute granularity, roman-numeral pages). Now we need to validate those fixes and find remaining issues across all brief types.\n\n## File Layout\n\n```\nlawthing/\n├── scripts/\n│ ├── pdf_to_pages.py # PDF → per-page text files\n│ ├── extract_ground_truth.py # Extract TOA section from PDF front matter\n│ ├── score.py # Compare extraction vs ground truth\n│ └── run_brief.sh # Run extraction on one brief via claude CLI\n├── data/\n│ ├── ysleta-petitioners/\n│ │ ├── pages/ # page-i.txt, page-001.txt, ...\n│ │ ├── ground-truth.md # TOA from the brief itself\n│ │ └── extraction-output.md # Skill output\n│ ├── ysleta-opposition/ # (same structure)\n│ ├── cadc/\n│ └── tx-5th-dist/\n├── skills/\n│ └── extract-toa/\n│ └── SKILL.md # Working copy of the skill (iterated on)\n├── extract-toa-skill.md # (existing, reference copy)\n├── brief-*.pdf # (existing, 4 PDFs)\n└── ... # (existing files)\n```\n\n## Steps\n\n### 1. Install nothing — use `uv run --with`\n\n`uv` (0.9.28) is already installed. All Python scripts run via `uv run --with pdfplumber` / `uv run --with rapidfuzz`. No venv, no requirements.txt.\n\n### 2. Build `scripts/pdf_to_pages.py`\n\nExtracts text from a PDF, one file per page. Key behaviors:\n- Detect page numbers (standalone numbers on their own line) — use as display numbers\n- Roman-numeral front matter pages: `page-i.txt`, `page-ii.txt`, etc.\n- Each file starts with the page number on line 1 (matching skill expectations)\n- Preserve line numbers within each page (tab-separated second column) for debugging\n- `--page-range` flag for partial extraction during dev\n- Handle the 43MB CADC brief (pdfplumber loads lazily, so memory is fine)\n\n### 3. Build `scripts/extract_ground_truth.py`\n\nTwo-stage ground truth extraction:\n- **Stage 1 (Python/pdfplumber):** Find pages between \"TABLE OF AUTHORITIES\" heading and next major section. Output raw text to `toa-raw.txt`.\n- **Stage 2 (Claude CLI):** Parse raw TOA into structured markdown tables (same format as `toa-draft.md`). Claude handles the formatting variation across courts (dotted leaders, continuation pages, column layouts).\n\nOutput format — markdown tables by category with `Authority | Pages` columns.\n\n### 4. Build `scripts/score.py`\n\nCompare extraction output vs ground truth. Key design:\n- Parse both into `(authority_text, page_set)` tuples per category\n- Match using key-component extraction: `(volume, reporter, start_page)` for cases, `(title, code, section)` for statutes — more reliable than fuzzy string matching\n- Fall back to rapidfuzz (threshold 85%) for authorities that don't match structurally\n- Score: correct / false_negative / false_positive / page_discrepancy / ambiguous (passim vs explicit)\n- Output `score.json` per brief + optional `--aggregate` mode for cross-brief report\n- Classify false negative causes heuristically (parenthetical-only, roman-numeral page, standalone section, filing exclusion, etc.)\n\n### 5. Build `scripts/run_brief.sh`\n\nRuns extraction on one brief via `claude -p`:\n```bash\nunset CLAUDECODE # avoid nested-session error\nclaude -p --dangerously-skip-permissions --model sonnet \\\n --max-budget-usd 5.00 \\\n \"Read the skill at skills/extract-toa/SKILL.md. Extract TOA from pages in data/<brief>/pages/. Output markdown tables.\"\n```\n\nUses sonnet for iteration speed, opus for final validation. `--max-budget-usd` prevents runaway subagent chains.\n\n### 6. Copy skill to working location\n\nCopy `extract-toa-skill.md` → `skills/extract-toa/SKILL.md`. All iteration edits target the working copy.\n\n### 7. Run pipeline: pages → ground truth → extraction → scoring\n\nOrder of briefs (smallest/best-understood first):\n1. `ysleta-opposition` — has existing error analysis, known ground truth\n2. `ysleta-petitioners` — same case, SCOTUS format\n3. `tx-5th-dist` — state court, different conventions\n4. `cadc` — largest (43MB), run last\n\n### 8. Iterate on skill\n\nLoop:\n1. Run one brief, check `score.json`\n2. Identify error pattern\n3. Edit `skills/extract-toa/SKILL.md`\n4. Re-run, confirm improvement\n5. Run all briefs, check cross-brief report\n6. Repeat until accuracy stabilizes across all 4\n\n### 9. Generalization improvements to the skill\n\nBeyond the bugs already fixed, the skill needs:\n- **State court record abbreviations:** Add `CR`, `ER`, `TR` to the filing exclusion examples\n- **State reporters:** Add `S.W.`, `S.W.2d`, `S.W.3d`, `N.E.`, `A.2d`, etc. to the reporter list\n- **Flexible categories:** Don't assume all 4 standard categories exist — state briefs may lack Legislative Materials\n- **Line numbers in extraction output:** Add to intermediate extraction results (for debugging), not to final TOA\n\n## Verification\n\n- Each brief's `score.json` shows accuracy metrics\n- Cross-brief report identifies patterns that span briefs\n- Target: match or exceed the original skill's accuracy on the opposition brief (~83 authorities found) while achieving comparable accuracy on the other 3 briefs\n- Final validation: re-run all 4 briefs with opus model, compare to sonnet iteration runs\n"
}