Transcript: Building Self-Scoring Infrastructure

018:35:34

check in with what you know about the thread (or use another agent to re-read the thread) and confirm, will this satisfy userb

HUMAN

118:35:47

Task

general-purpose Re-read Bluesky thread for userb

171 lines returned

I now have a comprehensive picture. Let me compile the complete findings.

---

## Complete Thread Analysis

### Who is userb?

**Handle:** `userb.bsky.social`  
**Display Name:** User B  
**Role:** She is a **paralegal** (she explicitly says "I'm a paralegal" in the thread). She works in law, uses Westlaw deterministically, and has direct experience evaluating LLM tools for legal work. She appears to have significant visibility in the legal tech skeptic space -- she says "given my visibility in this space, if it worked as advertised, my recommendation would plausibly go a long way."

She is **the primary person who "goaded" user-a into building the Table of Authorities extraction tool.** User A explicitly confirms this in thread 1:

> "User B is a true poster in it for the love of the game. I had a great time arguing and I commend her for finally figuring out the way to nerd snipe me is to get 10 lawyers together to tell me about their UX problem and bet me I can't use Claude to solve it"

And:

> "I should clarify that I didn't feel User B implied this, only the weird clowns I muted who kept trying to agree with her"

---

### What User B Said and Wanted -- The Full Arc

The conversation spans from **Feb 27 to Mar 2, 2026** and has roughly **200+ posts** from User B alone. Here is the substantive arc:

#### Phase 1: Initial challenge (Feb 27)

She enters the thread when User A says lawyers who got sanctioned for using LLMs should have "used true citations":

- **"What do you mean by 'true ones'?"**
- **"The 'correctly' is the tricky bit there"**
- **"The problem is that in legal work, the only validator is the judge and you only get one run at it"**
- **"All the arguments that everyone is making are ultimately this argument. Using an LLM for legal work is like using an LLM to write code that will be pushed directly to prod without QC and can never be rolled back."**
- **"Oh, but there is -- it doesn't reason."**

She also named specific legal professionals involved: "one of whom is literally counsel to the NTEU" and linked to @anon-28.bsky.social (Will Li).

#### Phase 2: The citation test challenge (Feb 28)

This is where she sets up the test that eventually goads User A into building the tool:

- **"Would you be willing to run a test with them for me?"**
- She provides a Supreme Court brief: `www.supremecourt.gov/DocketPDF/20/20-493/203906/...`
- **"Please prompt your LLM to identify every citation to legal authority in this brief, by page number, line number, and the text of the citation. It should ONLY identify citations to legal authority."**
- **"This is a task that can be easily accomplished by an untrained 12-year-old, btw."**

When the results come back, she evaluates them:

- **"Okay, yes, these have both failed in the same way, which is exactly the way I expected. Because they don't reason, they were unable to distinguish between *citations* to legal authority and mere *references* to legal authority -- they identified all references
… (12307 chars total)

218:35:47

eventual-frolicking-engelbart43 steps · 43 messages

I need to re-read a Bluesky thread to find any posts or replies by someone named "userb" (or similar spelling). The thread is rooted at a post by user-a.bsky.social about a Claude skill for extracting Table of Authorities from legal briefs. Use the Bluesky API to get the full thread with all replies: https://public.api.bsky.app/xrpc/app.bsky.feed.getPostThread?uri=at://user-a.bsky.social/app.bsky.feed.post/3mg3ichachs2n&depth=100&parentHeight=100 Also check the root post's thread: https://public.api.bsky.app/xrpc/app.bsky.feed.getPostThread?uri=at://user-a.bsky.social/app.bsky.feed.post/3mfzlau353c2h&depth=100&parentHeight=100 I need to find: 1. Any posts/replies by userb (or similar handle) in the thread 2. What they said — their feedback, requests, complaints, what they wanted 3. Who they are in context (are they one of the people who "goaded" user-a into building this?) 4. Any other participants in the thread and what they said Return the FULL text of every reply in the thread, with author handles, so I can see the complete conversation. Pay special attention to anyone who seems to be providing the briefs or requesting features.

[WebFetch] https://public.api.bsky.app/xrpc/app.bsky.feed.getPostThread?uri=at://user-a.bsky.social/app.bsky.feed.post/3mg3ichachs2n&depth=100&parentHeight=100

[WebFetch] https://public.api.bsky.app/xrpc/app.bsky.feed.getPostThread?uri=at://user-a.bsky.social/app.bsky.feed.post/3mfzlau353c2h&depth=100&parentHeight=100

[result] # Thread Analysis: Table of Authorities Extraction ## Key Posts by User B **Handle:** userb.bsky.social **Display Name:** User B ### Post 1 (Feb 28, 07:17 UTC) "Okay, yes, these have both failed

[result] # Thread Analysis: LLM Reasoning and Legal Work ## Complete Thread Posts **1. Root Post** - **Handle:** user-a.bsky.social - **Display Name:** User A - **Text:** "now we're talking. a claude skill b

[Bash] curl -s 'https://public.api.bsky.app/xrpc/app.bsky.feed.getPostThread?uri=at://user-a.bsky.social/app.bsky.feed.post/3mg

[Bash] curl -s 'https://public.api.bsky.app/xrpc/app.bsky.feed.getPostThread?uri=at://user-a.bsky.social/app.bsky.feed.post/3mf

[result] <persisted-output> Output too large (227.4KB). Full output saved to: [REDACTED]/b06q9mg4g.txt Preview (fir

[result] <persisted-output> Output too large (145.4KB). Full output saved to: [REDACTED]/b3ddvr4g7.txt Preview (fir

[Bash] cat [REDACTED]/b06q9mg4g.tx

[result] ================================================================================ THREAD 1 (3mg3ichachs2n) - Table of Authorities thread ================================================================

[Bash] cat [REDACTED]/b3ddvr4g7.tx

[result] ================================================================================ THREAD 2 (3mfzlau353c2h) - Parent/root thread =========================================================================

[Bash] cat [REDACTED]/b3ddvr4g7.tx

[result] ALL UNIQUE HANDLES IN THREAD 2: @anon-30.bsky.social @user-a.bsky.social @anon-11.bsky.social @user-b.bsky.social USER B POSTS FOUND: 7 [2026-03-01T20:28:26.880Z] What have you learned from

[Bash] cat [REDACTED]/b06q9mg4g.tx

[result] ALL UNIQUE HANDLES IN THREAD 1: @bunkum.us @danny.page @user-a.bsky.social @anon-08.bsky.social @anon-35.bsky.social @searyanc.dev USER B POSTS FOUND: 0

[Bash] # The root post referenced in thread 2's parent chain is 3mfk33buzyk27, let me fetch that curl -s 'https://public.api.bs

[result] ROOT THREAD (3mfk33buzyk27) - All posts sorted chronologically ================================================================================ [2026-02-23T16:28:23.682Z] @user-a.bsky.social (User A)

[Bash] # Freddy linked to bsky.app/profile/kath... - let's look for User B's posts that branch off # First let me find all user

[result] ALL USER B POSTS in root thread: [2026-02-27T07:01:11.965Z] @user-b.bsky.social URI: at://did:plc:g6asx7scljsgdu3vqwtyfszj/app.bsky.feed.post/3mft5aqkhec2z one of whom is literally counsel to the NTEU

[Bash] # Let me look at User B's profile to find her posts in this conversation # First, let me get her author feed curl -s 'ht

[result] <persisted-output> Output too large (33.4KB). Full output saved to: [REDACTED]/bvkdqpydj.txt Preview (firs

[Read] [REDACTED]/bvkdqpydj.txt

[result] 1→[2026-03-02T23:57:48.668Z] 2→URI: at://did:plc:g6asx7scljsgdu3vqwtyfszj/app.bsky.feed.post/3mg4hhdwd5s2u 3→Root: at://did:plc:ctphcgyhnllfluywokibtrgb/app.bsky.feed.post/3mfk33buzyk27

[Read] [REDACTED]/bvkdqpydj.txt

[result] <system-reminder>Warning: the file exists but is shorter than the provided offset (483). The file has 483 lines.</system-reminder>

[Bash] # Get more of her posts, including earlier ones curl -s 'https://public.api.bsky.app/xrpc/app.bsky.feed.getAuthorFeed?ac

[result] Cursor: 2026-03-02T04:24:36.865Z Total posts returned: 100 Last post: 2026-03-02T04:24:36.865Z

[Bash] # Need to paginate to get earlier posts (Feb 27-28) curl -s 'https://public.api.bsky.app/xrpc/app.bsky.feed.getAuthorFee

[result] <persisted-output> Output too large (38KB). Full output saved to: [REDACTED]/blamlkxyi.txt Preview (first

[Read] [REDACTED]/blamlkxyi.txt

[result] 1→[2026-03-02T04:22:18.176Z] [IN THREAD] 2→URI: at://did:plc:g6asx7scljsgdu3vqwtyfszj/app.bsky.feed.post/3mg2freskk22f 3→Root: at://did:plc:ctphcgyhnllfluywokibtrgb/app.bsky.feed.post/3

[Bash] # Get page 3 - earlier posts curl -s 'https://public.api.bsky.app/xrpc/app.bsky.feed.getAuthorFeed?actor=userb.bsky.soci

[result] [2026-03-02T01:19:44.898Z] [IN THREAD] Text: I’m not making a claim. I’m asking a question. Please answer the question I asked. ============================================================ [2026-03-02

[Bash] # Get page 4 - need to get to Feb 28 and earlier curl -s 'https://public.api.bsky.app/xrpc/app.bsky.feed.getAuthorFeed?a

[result] [2026-03-01T23:37:02.548Z] [IN THREAD] Text: Girl, same. ============================================================ [2026-03-01T21:56:14.946Z] [IN THREAD] Text: This is the essence of my argument: w

[Bash] # Get page 5 - need earlier posts from Feb 28 curl -s 'https://public.api.bsky.app/xrpc/app.bsky.feed.getAuthorFeed?acto

[result] [2026-03-01T18:11:49.441Z] [IN THREAD] Text: What do you use it for besides that? ============================================================ [2026-03-01T18:10:05.300Z] [IN THREAD] Text: I asked for

[Bash] # Get page 6 - remaining Feb 28 posts curl -s 'https://public.api.bsky.app/xrpc/app.bsky.feed.getAuthorFeed?actor=user

[result] [2026-02-28T21:12:22.562Z] [IN THREAD] Text: I’d say that a lot of the time, they actually can’t X. This is one of the constant frustrations I experience whenever someone tries to foist one on me — th

[Bash] # Get page 7 - the earliest posts in the thread curl -s 'https://public.api.bsky.app/xrpc/app.bsky.feed.getAuthorFeed?ac

[result] [2026-02-28T04:57:16.962Z] [IN THREAD] Text: I don't understand. ============================================================ [2026-02-28T04:53:33.077Z] [IN THREAD] Text: Did it do it yet? ===========

318:48:08

thinking16 lines · 1012 chars

This is extremely detailed. Let me synthesize what User B actually wants and evaluate whether the fixed skill would satisfy her. Key things User B wants: 1. Extract all citations (not references) with page and line numbers 2. Generate a TOA 3. Work across diverse briefs, not just one 4. Zero/near-zero false positives 5. Fast 6. She was initially impressed with the TOA output ("pretty damn good") 7. But later pushed back: "you didn't make it work" / "you proved you could tune the prompt to be partially effective on a single brief" 8. She wanted it run on ALL the exemplar briefs she provided 9. She cares about the citation vs. reference distinction deeply Her real frustration was that the tool was developed and tested on just one or two briefs when she'd provided multiple exemplars—it wasn't generalizable, just overfit to that first brief. Plus, the current version only captures page numbers, not line numbers, and I'm still not properly distinguishing citations from references the way she needs.

418:48:21

No, it won't. Here's why. User B is a paralegal who essentially nerd-sniped User A into building this. She initially praised the TOA output ("pretty damn good") but her **end-of-thread position** was skepticism: > "You proved that you could tune the prompt to be partially effective on a single brief. Was that your point?" > "Did you even run it on all the exemplars I found for you?" > "You asked for ten briefs and only ran one or two as far as I can tell" Her complaints are not about the specific bugs I fixed. They're structural: 1. **Generalization.** The skill was developed against one brief and tested on one more. She provided 4+ briefs and wanted all of them tested. She (and @bunkum.us) correctly identified that the prompt is overtrained on known cases — like a regex that handles the examples you've seen but fails on new ones. My fixes are derived from the same single error analysis, so they're more of the same pattern. 2. **Line numbers.** She repeatedly asked for line numbers, not just page numbers: "I'd like to see it with the line numbers to make sure it's discriminating appropriately." The skill only tracks pages. 3. **Citations vs. references.** Her core technical objection is that LLMs can't distinguish between a *citation to authority* (putting forward evidence for a legal claim) and a *mere reference* (mentioning a case as a noun/historical event). She calls this a teleological distinction — you have to understand *how* the authority is being used, not just recognize the citation format. The skill doesn't address this at all; it treats anything that looks like a citation syntactically as a citation. 4. **She wanted a product, not a prompt.** She said she'd "pay for this tomorrow out of my own pocket" and asked User A about billing rates — but she also said "I'm not splitting my workflow across two tools" and wanted it inside Westlaw/Lexis. The fixes I made are correct for the bugs they address — they'd eliminate ~15 of 21 false negatives. But User B would say: "great, you refined the prompt against *one more* brief. Run it on the CADC brief, the Texas appeals brief, and the NY court decision, then we'll talk."

CLAUDE

518:51:46

i'm not gonna do the westlaw integration (today, at least), but i want to address her other points. given the full thread, can we make a skill that will both hit the success level of the first skill on it's given document, and, also, do similarly well on the others? we can use subagents to loop over testing it

HUMAN

618:51:50

EnterPlanMode

{}

11 lines returned

Entered plan mode. You should now focus on exploring the codebase and designing an implementation approach.

In plan mode, you should:
1. Thoroughly explore the codebase to understand existing patterns
2. Identify similar features and architectural approaches
3. Consider multiple approaches and their trade-offs
4. Use AskUserQuestion if you need to clarify the approach
5. Design a concrete implementation strategy
6. When ready, use ExitPlanMode to present your plan for approval

Remember: DO NOT write or edit any files yet. This is a read-only exploration and planning phase.

718:52:23

thinking22 lines · 2686 chars

Let me think about what's needed here. The user wants to: 1. Build a skill that generalizes across multiple briefs (not just the one it was developed against) 2. Use subagents to test it against all the briefs User B provided 3. Address User B's specific concerns: line numbers, citation vs reference distinction, generalization Let me first understand what we have: - The extract-toa-skill.md (the prompt, already partially fixed) - 4 PDF briefs downloaded - The error analysis from the first test - The test-toa-skill.md (the testing harness) Now I'm working through the core challenges: we only have ground truth for the opposition brief, so I'll need to figure out how to validate against the other three. The citation versus reference distinction is critical—that's what User B really wants the skill to capture. And the current implementation only tracks page numbers, not line numbers. My plan is to enhance the skill to handle these distinctions, add line number tracking, then run it against all four briefs and use those results to refine it further. Actually, I just realized the SCOTUS briefs themselves include a table of authorities at the front, which I could extract and use as ground truth to test the skill's extraction from the body of each brief. Let me check what we're actually working with in these PDFs. I need to set up a proper testing framework where I develop against the first brief and validate against the other three to avoid overfitting, which means I should first explore what PDF extraction tools are available and whether these briefs already contain their own tables of authorities I can use as ground truth. Now I'm realizing User B's distinction between citations and references was about a different task entirely—extracting citations from brief body text, not building a table of authorities. For the TOA specifically, I want everything with a legal citation format, so that distinction doesn't really apply here. The core challenges for improving the TOA skill are generalization across different brief types, reducing false positives and negatives, tracking line numbers accurately, and fixing the specific bugs we've already identified. I'm thinking through a two-phase approach: first extracting text from the PDFs and pulling out existing TOAs as ground truth, then improving the skill itself by adding line number tracking, refining the citation rules to be less SCOTUS-centric, and handling the citation versus reference distinction where it actually matters. Now I need to build a test harness that can run the extraction against all four briefs, compare results to ground truth when available, and give me metrics to iterate on.

818:52:23

Let me explore what we're working with and figure out the right approach.

CLAUDE

918:52:32

Task

Explore Explore PDF briefs and tooling

103 lines returned

Now let me generate a comprehensive summary. Here's what I've found:

## Summary of Findings

### 1. PDF Text Extraction Tools Available

**On macOS system:**
- `pdftotext`: NOT available
- `poppler`: NOT available (via brew)
- Python PDF libraries (pymupdf/fitz, pdfplumber, pypdf, PyPDF2, pdf2image, reportlab, pdfrw, pikepdf): **NONE installed**

**What IS available:**
- Python 3.14.2 installed
- Basic pip packages only (certifi, cffi, cryptography, pycparser, wheel)
- **No PDF extraction infrastructure currently on the system**

**This is critical**: To extract text from PDFs, you'll need to install at least one library. Recommended options:
- `pdfplumber` - most beginner-friendly for text extraction and TOA work
- `PyPDF2` - lightweight, no extra dependencies
- `pymupdf/fitz` - fastest, but more complex
- `pypdf` - modern successor to PyPDF2

### 2. The Four PDF Briefs

| File | Size | Type | Pages Detected |
|------|------|------|----------------|
| brief-ysleta-petitioners.pdf | 592 KB | PDF v1.6 | 0 pages reported |
| brief-ysleta-opposition.pdf | 408 KB | PDF v1.6 | 21 pages |
| brief-tx-5th-dist.pdf | 3.5 MB | PDF v1.6 | Unknown |
| brief-cadc.pdf | 41 MB | PDF v1.7 | Unknown (very large) |

**Note**: The petitioners brief reports 0 pages in the file metadata - this may be a metadata issue or require inspection with actual PDF reading.

**Status**: No page directories have been created yet (no `pages/` subdirectories exist).

### 3. Existing Infrastructure & Documentation

**Already in place:**
- ✓ `extract-toa-skill.md` (202 lines) - detailed extraction skill with:
  - Full citation rules (cases, statutes, legislative materials, other authorities)
  - Exclusion rules (record references, jurisdictional statutes, self-references)
  - Short-form resolution patterns
  - Page tracking rules
  - Output formatting standards
  - Execution model: parallel subagent chunks + verification phase + merge phase

- ✓ `test-toa-skill.md` (49 lines) - testing framework for comparing extraction against ground truth

- ✓ `error-analysis.md` (103 lines) - detailed error analysis from a test run against Ysleta opposition brief showing:
  - 83 extracted authorities vs. 99 in ground truth
  - 53 correct matches, 21 false negatives, 7 page discrepancies, 7 ambiguous
  - Key error patterns identified (parenthetical-only exclusion too aggressive, passim undercounting, statute granularity issues, roman-numeral page skipping)

- ✓ `toa-draft.md` (100 lines) - sample TOA output showing structure and format

- ✓ `thread-links.md` - source references (GitHub gists, Supreme Court briefs, D.C. Circuit brief, TX state court brief)

- ✓ `claude` CLI v2.1.62 available at `/opt/homebrew/bin/claude` - can run subagents

- ✓ `.claude/settings.local.json` - project-level settings (Bluesky API permissions only)

### 4. Ground Truth Available

The error analysis document references a ground truth TOA for the Ysleta opposition brief with:
- 52 case citations
- 32 statute/rule cita
… (4669 chars total)

1018:52:32

eventual-frolicking-engelbart6 steps · 6 messages

I need to understand what we're working with for a legal brief TOA extraction project. Explore the following: 1. Check what PDF text extraction tools are available on this macOS system (pdftotext, poppler, pymupdf, pdfplumber, etc.). Run `which pdftotext`, `python3 -c "import fitz; print('pymupdf ok')"`, `python3 -c "import pdfplumber; print('pdfplumber ok')"`, check if `brew list poppler` works, etc. 2. Look at the 4 PDF briefs in [REDACTED]/: - brief-ysleta-petitioners.pdf - brief-ysleta-opposition.pdf - brief-cadc.pdf - brief-tx-5th-dist.pdf For each one, try to extract the first 5-10 pages of text to understand: - Does it have a Table of Authorities we can use as ground truth? - How are page numbers formatted? - What's the overall structure? - How long is it (page count)? 3. Check if there are any existing scripts, pages directories, or other processing artifacts already in [REDACTED]/ 4. Check what's available for running Claude subagents - is there a claude CLI available? Run `which claude` and `claude --version` if found. Report back all findings in detail - I need to know what infrastructure exists and what ground truth we can extract from these briefs.

[Bash] which pdftotext && which poppler-utils && brew list poppler 2>/dev/null || echo "Checking pdftotext and poppler..."

[Bash] python3 -c "import fitz; print('pymupdf: OK')" 2>&1

[Bash] python3 -c "import pdfplumber; print('pdfplumber: OK')" 2>&1

[Bash] which claude && claude --version 2>&1

[Glob] **/*

1118:52:36

Task

Explore Read existing skill files

188 lines returned

I've read all four files. Here's my detailed structural analysis:

---

## EXECUTION MODEL

### extract-toa-skill.md

**Three-phase pipeline:**
1. **Phase 1: Per-chunk parallel extraction** — Split brief into ~10-page chunks, launch one Sonnet subagent per chunk. Each extracts citations with metadata (page, category, form, parenthetical, short-form mapping, ID status, exclusion flags).
2. **Phase 1.5: Verification (parallel, per-chunk)** — Haiku subagents verify extracted citations actually appear on claimed pages. Catches footnote boundary errors, signal-word spillover, multi-line citations.
3. **Phase 2: Merge (single subagent)** — Sonnet resolves short forms across chunk boundaries, de-duplicates, applies exclusion rules, computes page lists with passim logic, outputs final TOA.

**Front-matter handling:** Explicitly says extract roman-numeral pages — but error analysis shows this wasn't done.

---

## BAKED-IN ASSUMPTIONS & OVERFITS

### 1. **Parenthetical-only exclusion (BROKEN)**
- **Assumption:** Cases/statutes appearing only inside `(quoting ...)`, `(citing ...)` don't belong in TOA.
- **Reality:** Ground truth includes ALL parenthetical citations. This caused 13 of 21 false negatives — the most severe systematic error.
- **Scope:** This rule appears universally in legal briefs, but the skill got it backwards. Not SCOTUS-specific, but fundamentally wrong.

### 2. **Id. page-tracking rule (PARTIALLY BROKEN)**
- **Assumption:** `Id.` and `id. at [page]` never add pages — purely referential.
- **Exception carved out:** When `id.` abbreviates only the title/volume of a code (e.g., `id. § 2703` where id = 25 U.S.C.), each new section counts as a first citation of a distinct authority.
- **Reality:** Ground truth shows the exception applies, but merge agent didn't apply it aggressively enough. Restoration Act sections severely undercounted.
- **Scope:** Universal, not SCOTUS-specific.

### 3. **Generic defined-abbreviation short forms (UNDERPERFORMING)**
- **Assumption:** Generic abbreviations like "S. Rep." don't add pages; distinctive named acts like "Restoration Act" or "IGRA" do add pages and trigger passim consideration.
- **Rule:** "If the abbreviation could only refer to one specific authority, it adds pages."
- **Reality:** Ground truth marks Restoration Act and §§ 107(a)/(b) as passim; draft only captured pages where full citation appeared, then undercounted named-act references.
- **Scope:** Not SCOTUS-specific, but heavily weighted toward cases where a statute is central to the brief.

### 4. **Roman-numeral page extraction skipped**
- **Skill says:** "Extract citations from roman-numeral pages too."
- **What happened:** Extraction started at page 1, skipping page "I". Ground truth includes IGRA citation from page I.
- **Scope:** Applies to any brief with front matter (all of them). Not SCOTUS-specific.

### 5. **Filing/record exclusion too broad**
- **Assumption:** App., Pet. App., BIO, Pet. Br., Resp. Br., JA, Conditional Cross-Pet
… (13301 chars total)

1218:52:36

eventual-frolicking-engelbart9 steps · 9 messages

Read and analyze the following files in [REDACTED]/: 1. extract-toa-skill.md - the main extraction prompt/skill 2. test-toa-skill.md - the testing/comparison skill 3. error-analysis.md - the error analysis from testing 4. thread-links.md - the links index For each, provide a detailed structural summary. I need to understand: - What the skill's execution model is (phases, subagents, etc.) - What assumptions are baked in that might not generalize - What would need to change to make it work across SCOTUS briefs, circuit court briefs, and state court briefs - Whether the test skill can actually be used as-is or needs rework Pay special attention to anything that's SCOTUS-specific or overfitted to one brief type.

[Read] [REDACTED]/extract-toa-skill.md

[result] 1→# Skill: Extract Table of Authorities from a Legal Brief 2→ 3→Given the body text of a legal brief (with the existing Table of Authorities removed), extract all substantive citations

[Read] [REDACTED]/test-toa-skill.md

[result] 1→# Skill: Test TOA Extraction Against Ground Truth 2→ 3→Run the extract-toa skill on a brief and compare the output against a ground truth file. Uses subagents to keep the main context

[Read] [REDACTED]/error-analysis.md

[result] 1→# TOA Extraction Error Analysis 2→ 3→Brief: Ysleta del Sur Pueblo v. Texas, No. 20-493, Brief in Opposition 4→ 5→## Overall Accuracy 6→ 7→| Category | Draft | Grou

[Read] [REDACTED]/thread-links.md

[result] 1→# Bluesky Thread Links 2→ 3→Thread by @user-a.bsky.social about a Claude skill for extracting Table of Authorities from legal briefs. 4→ 5→Source: https://bsky.app/profile/u

1318:52:52

eventual-frolicking-engelbart32 steps · 32 messages

[result] pdftotext not found Checking pdftotext and poppler...

[result] Exit code 1 Traceback (most recent call last): File "<string>", line 1, in <module> import fitz; print('pymupdf: OK') ^^^^^^^^^^^ ModuleNotFoundError: No module named 'fitz' Traceback (most

[result] <tool_use_error>Sibling tool call errored</tool_use_error>

[Bash] ls -la [REDACTED]/

[result] total 93912 drwxr-xr-x@ 12 j staff 384 Mar 2 18:34 . drwxr-xr-x@ 51 j staff 1632 Mar 2 18:21 .. drwxr-xr-x@ 3 j staff 96 Mar 2 18:23 .claude -rw-r--r--@ 1 j staff 42555812

[Bash] which pdftotext

[result] Exit code 1 pdftotext not found pdftotext not found

[Bash] python3 --version && which python3

[result] <tool_use_error>Sibling tool call errored</tool_use_error>

[Bash] python3 --version

[result] Python 3.14.2

[Read] [REDACTED]/extract-toa-skill.md

[result] 1→# Skill: Extract Table of Authorities from a Legal Brief 2→ 3→Given the body text of a legal brief (with the existing Table of Authorities removed), extract all substantive citations

[Read] [REDACTED]/error-analysis.md

[result] 1→# TOA Extraction Error Analysis 2→ 3→Brief: Ysleta del Sur Pueblo v. Texas, No. 20-493, Brief in Opposition 4→ 5→## Overall Accuracy 6→ 7→| Category | Draft | Grou

[Read] [REDACTED]/toa-draft.md

[result] 1→# Draft Table of Authorities 2→ 3→## Cases 4→ 5→| Authority | Pages | 6→|-----------|-------| 7→| Alabama-Coushatta Tribe of Tex. v. Texas, 540 U.S. 882 (2003) | 1

[Bash] python3 -c "import pdfplumber; print('pdfplumber: OK')" 2>&1

[Bash] which claude && claude --version 2>&1

[Read] [REDACTED]/.claude

[result] Exit code 1 Traceback (most recent call last): File "<string>", line 1, in <module> import pdfplumber; print('pdfplumber: OK') ^^^^^^^^^^^^^^^^^ ModuleNotFoundError: No module named 'pdfplum

[result] <tool_use_error>Sibling tool call errored</tool_use_error>

[Bash] ls -la [REDACTED]/

[result] total 8 drwxr-xr-x@ 3 j staff 96 Mar 2 18:23 . drwxr-xr-x@ 12 j staff 384 Mar 2 18:34 .. -rw-r--r--@ 1 j staff 128 Mar 2 18:23 settings.local.json

[Bash] which claude

[result] /opt/homebrew/bin/claude

[Bash] claude --version

[Read] [REDACTED]/settings.local.json

[Bash] pdftotext -v 2>&1 | head -5

1418:53:46

eventual-frolicking-engelbart31 steps · 31 messages

[result] 2.1.62 (Claude Code)

[result] 1→{ 2→ "permissions": { 3→ "allow": [ 4→ "WebFetch(domain:bsky.app)", 5→ "WebFetch(domain:public.api.bsky.app)" 6→ ] 7→ } 8→} 9→

[result] (eval):1: command not found: pdftotext

[Bash] file [REDACTED]/brief-*.pdf

[Bash] python3 -c "import pypdf; print('pypdf: OK')" 2>&1

[Bash] python3 -c "import PyPDF2; print('PyPDF2: OK')" 2>&1

[Bash] ls -lh [REDACTED]/brief-*.pdf

[result] [REDACTED]/brief-cadc.pdf: PDF document, version 1.7 (zip deflate encoded) [REDACTED]/brief-tx-5th-dist.pdf: PDF document, version 1.6 (zip defla

[result] Exit code 1 Traceback (most recent call last): File "<string>", line 1, in <module> import pypdf; print('pypdf: OK') ^^^^^^^^^^^^ ModuleNotFoundError: No module named 'pypdf' Traceback (mos

[result] <tool_use_error>Sibling tool call errored</tool_use_error>

[Bash] ls -lh [REDACTED]/brief-*.pdf

[Read] [REDACTED]/test-toa-skill.md

[Read] [REDACTED]/thread-links.md

[result] -rw-r--r--@ 1 j staff 41M Mar 2 18:28 [REDACTED]/brief-cadc.pdf -rw-r--r--@ 1 j staff 3.5M Mar 2 18:28 [REDACTED]/brief-tx-5th-dist.pdf -rw-r--r--@ 1 j staf

[result] 1→# Bluesky Thread Links 2→ 3→Thread by @user-a.bsky.social about a Claude skill for extracting Table of Authorities from legal briefs. 4→ 5→Source: https://bsky.app/profile/u

[Bash] python3 << 'EOF' import sys try: import poppler print("poppler-python: OK") except: pass try: import pd

[result] Installed Python packages: Package Version ------------ -------- certifi 2026.1.4 cffi 2.0.0 cryptography 46.0.5 pip 25.3 pycparser 3.0 wheel 0.46.3

[Bash] find [REDACTED]/lawthing -type d -name "pages*" 2>/dev/null

[result]

[Bash] find [REDACTED]/lawthing -type f \( -name "*.py" -o -name "*.sh" -o -name "Makefile" \) 2>/dev/null

[result]

[Bash] du -sh [REDACTED]/* 2>/dev/null | sort -h

[Bash] grep -r "page" [REDACTED]/*.md 2>/dev/null | head -20

[result] 4.0K [REDACTED]/test-toa-skill.md 4.0K [REDACTED]/thread-links.md 8.0K [REDACTED]/toa-draft.md 12K [REDACTED]/error-analysis.md 20

[result] [REDACTED]/error-analysis.md:| Indian Gaming Regulatory Act, 25 U.S.C. §§ 2701-21 | I | General IGRA citation on table-of-contents page; extraction started at page 1, skipping roman-n

[Bash] brew list | grep -i pdf

[Bash] python3 << 'EOF' # Try to see what's actually available for PDF reading import subprocess import sys pdf_modules = [

[result]

[result] PDF module availability: fitz: NOT INSTALLED pdf2image: NOT INSTALLED pypdf: NOT INSTALLED PyPDF2: NOT INSTALLED pdfplumber: NOT INSTALLED reportlab: NOT INSTALLED pdfrw: NOT INSTALLED

1518:55:24

thinking38 lines · 4199 chars

Good, now I have a solid understanding. Let me think about the plan: Key findings: 1. No PDF extraction tools installed — need to install something (pdfplumber or pypdf) 2. Claude CLI v2.1.62 is available for running subagents 3. The skill itself is already mostly fixed (I did it earlier in this conversation) 4. We need ground truth for the other briefs 5. The SCOTUS briefs should contain their own TOAs we can extract 6. 43MB CADC brief is very large The approach: 1. Install PDF tools 2. Extract text from each brief, split into pages 3. Pull out existing TOAs from briefs that have them as ground truth 4. Build a test runner that orchestrates everything 5. Run the skill against all 4 briefs and compare results Actually, I'm realizing the user probably meant we can use subagents directly within this Claude Code session rather than building a separate external script. The test runner would need to split PDFs into per-page text, invoke Claude to run the extraction skill, and compare against ground truth — but I should think about how to structure this within the current environment rather than as a standalone script. I'm weighing two approaches: building a Python pipeline with pdfplumber and CLI calls, or doing everything within Claude Code using subagents via the Task tool. The second option aligns better with what the user suggested, since the skill is already designed for subagent execution. The tricky part is that the skill expects Claude Code's Task tool for subagents, whereas running it via CLI would use Claude's built-in subagent mechanism — I need to figure out which approach makes sense for this testing workflow. For a practical iteration loop, I should set up the infrastructure once (extracting PDF pages and ground truth), then repeatedly run the skill against all briefs, score the results, refine the skill, and test again. On the line number question User B raised—the skill currently only tracks page numbers, and adding line numbers would require preserving that information in the page files and having extraction subagents track it throughout, which is a substantial modification. Though actually, for a table of authorities specifically, line numbers aren't typically included anyway—that was a separate need she had in mind. I think the line number concern is really about debugging and verification, not about the final TOA format itself. I could add line numbers to the intermediate extraction output for debugging purposes without changing what the final TOA looks like. Now I'm mapping out the actual plan: install pdfplumber, build a Python script that extracts text from PDFs while preserving page numbers and line information, store the extracted pages in a directory structure, then extract ground truth TOAs from SCOTUS briefs to use as a comparison baseline. Beyond that, I need to address the remaining generalization issues like state court abbreviations and state reporters, plus adapt the skill to handle different brief types. Then I'll build a test runner that processes each brief, compares the extraction output to ground truth, and surfaces error patterns across multiple briefs so I can iterate on improvements. Before diving in, let me verify the ground truth situation — SCOTUS briefs always include a table of authorities since it's required, and circuit and state briefs typically do as well. The 43MB CADC brief is likely a multi-document filing or contains extensive appendices, which could complicate extraction. The real approach here is building a pipeline that moves from PDF to pages to extraction to scoring and back to iteration, with skill improvements driven by cross-brief error analysis rather than tuning for individual documents. The target is hitting that 83% success rate the first skill achieved on its own document. Looking at the metrics from the first skill — 99 authorities found with 53 correct, 21 false negatives, 1 false positive, and 7 page discrepancies — applying the identified fixes should reduce false negatives by roughly 15, bringing us to around 68 correct out of 99. But the goal isn't just improving on that one brief; it's achieving comparable accuracy across all the documents.

1618:55:24

Now let me design the approach.

CLAUDE

1718:55:36

Task

Plan Design test-driven skill iteration

453 lines returned

Good. Modern `uv` supports `uv run --with pdfplumber script.py` for zero-config dependency injection. Here is the complete architecture.

---

## Architecture: TOA Extraction Test Harness

### Summary of Constraints Discovered

1. **No PDF libraries installed.** `uv` (0.9.28) is available -- use `uv run --with pdfplumber` for zero-install script execution.
2. **`claude` CLI (2.1.62) cannot be nested** inside a Claude Code session. The test runner must be invoked from a terminal shell, not from within Claude Code.
3. **No `poppler` installed** -- cannot use `pdftotext` or Claude Code's built-in PDF reader. All PDF processing goes through `pdfplumber` via `uv run`.
4. **The 43MB CADC brief** will need page-count estimation first to determine chunking.
5. **The skill uses subagents** via Claude Code's Task tool, which means running the skill via `claude -p` with `--dangerously-skip-permissions` (for automated runs) will spawn those subagents automatically.

### File Layout

```
lawthing/
├── briefs/                          # symlinks or the PDFs themselves
│   ├── ysleta-petitioners.pdf
│   ├── ysleta-opposition.pdf
│   ├── cadc.pdf
│   └── tx-5th-dist.pdf
│
├── scripts/
│   ├── pdf_to_pages.py              # PDF -> per-page text files
│   ├── extract_ground_truth.py      # Extract TOA from brief front matter
│   ├── score.py                     # Compare extraction output vs ground truth
│   └── run_all.sh                   # Orchestrator: runs pipeline across all briefs
│
├── data/
│   ├── ysleta-petitioners/
│   │   ├── pages/                   # page-001.txt, page-002.txt, ...
│   │   ├── ground-truth.md          # Extracted TOA from front matter
│   │   ├── extraction-output.md     # Skill output
│   │   └── score.json               # Scoring results
│   ├── ysleta-opposition/
│   │   ├── pages/
│   │   ├── ground-truth.md
│   │   ├── extraction-output.md
│   │   └── score.json
│   ├── cadc/
│   │   └── ...
│   └── tx-5th-dist/
│       └── ...
│
├── results/
│   ├── cross-brief-report.md        # Aggregated cross-brief error analysis
│   └── iteration-log.jsonl          # Append-only log of each iteration's scores
│
├── skills/
│   └── extract-toa/
│       └── SKILL.md                 # The skill under test (copy of extract-toa-skill.md)
│
├── extract-toa-skill.md             # Original skill (existing)
├── error-analysis.md                # Existing
├── test-toa-skill.md                # Existing
└── toa-draft.md                     # Existing
```

### Component 1: `scripts/pdf_to_pages.py` -- PDF to Per-Page Text

**Purpose:** Extract text from a PDF, split into one file per page, with line numbers for debugging.

**Key design decisions:**
- Use `pdfplumber` (best text extraction for legal documents -- preserves columns, footnotes, headers).
- Output files named `page-001.txt`, `page-002.txt`, etc. First line of each file is the page number (matching the skill's expectations from section 1 of `extract-toa-skill.md`).
- Roman-numeral front ma
… (20660 chars total)

1818:55:36

eventual-frolicking-engelbart49 steps · 49 messages

I'm designing a system to iteratively improve a Claude Code skill for extracting Table of Authorities (TOA) from legal briefs. The skill needs to generalize across different brief types (SCOTUS, circuit court, state court). Here's the context: **Current state:** - We have 4 PDF briefs (SCOTUS petitioners, SCOTUS opposition, D.C. Circuit, Texas state) - We have an extraction skill (extract-toa-skill.md) that was developed against 1 brief, tested against 1 other - Error analysis shows 53/99 correct on the opposition brief, with 21 false negatives - Several bugs have been fixed in the skill already - Claude CLI v2.1.62 is available - No PDF tools installed yet (need pdfplumber or similar) - The skill uses a multi-phase subagent architecture **The core problem (from a legal professional's criticism):** The skill was overtrained on one brief. Like a regex that handles known cases but fails on unknown ones. We need to test against ALL briefs and iterate. **Ground truth problem:** - SCOTUS briefs contain their own TOAs (required by court rules) — we can extract these as ground truth - Circuit and state briefs usually also have TOAs - So we can extract ground truth from each brief's own front matter **Key constraints:** - The 43MB CADC brief is very large — may need special handling - Each extraction run involves multiple nested Claude subagent calls (expensive) - We want to iterate rapidly, so the test loop needs to be efficient - Line numbers should be in the extraction output for debugging, even though the final TOA only uses page numbers **Design question:** What's the most efficient architecture for: 1. PDF → page extraction pipeline (Python script) 2. Ground truth extraction from brief front matter 3. Test runner that invokes the skill against each brief 4. Scoring/comparison system 5. Error analysis that identifies cross-brief patterns (not just per-brief bugs) Consider: - Should we use the `claude` CLI to run the skill, or orchestrate from within Claude Code using Task agents? - How do we handle the nested subagent architecture efficiently? - Should we flatten the skill for testing (single-pass instead of multi-phase) to iterate faster, then validate the multi-phase version at the end? - How do we extract ground truth TOAs from PDF front matter automatically? Give me a concrete architecture with scripts and file layout.

[Bash] ls -la [REDACTED]/

[result] total 93912 drwxr-xr-x@ 12 j staff 384 Mar 2 18:34 . drwxr-xr-x@ 51 j staff 1632 Mar 2 18:21 .. drwxr-xr-x@ 3 j staff 96 Mar 2 18:23 .claude -rw-r--r--@ 1 j staff 42555812

[Glob] **/*.md

[Glob] **/*.py

[result] [REDACTED]/error-analysis.md [REDACTED]/toa-draft.md [REDACTED]/test-toa-skill.md [REDACTED]/thread-links.md [REDACTED]/lawt

[result] No files found

[Read] [REDACTED]/extract-toa-skill.md

[result] 1→# Skill: Extract Table of Authorities from a Legal Brief 2→ 3→Given the body text of a legal brief (with the existing Table of Authorities removed), extract all substantive citations

[Read] [REDACTED]/error-analysis.md

[result] 1→# TOA Extraction Error Analysis 2→ 3→Brief: Ysleta del Sur Pueblo v. Texas, No. 20-493, Brief in Opposition 4→ 5→## Overall Accuracy 6→ 7→| Category | Draft | Grou

[Read] [REDACTED]/test-toa-skill.md

[Read] [REDACTED]/toa-draft.md

[result] 1→# Draft Table of Authorities 2→ 3→## Cases 4→ 5→| Authority | Pages | 6→|-----------|-------| 7→| Alabama-Coushatta Tribe of Tex. v. Texas, 540 U.S. 882 (2003) | 1

[Read] [REDACTED]/thread-links.md

[result] 1→# Bluesky Thread Links 2→ 3→Thread by @user-a.bsky.social about a Claude skill for extracting Table of Authorities from legal briefs. 4→ 5→Source: https://bsky.app/profile/u

[Bash] ls -laR [REDACTED]/

[result] total 8 drwxr-xr-x@ 3 j staff 96 Mar 2 18:23 . drwxr-xr-x@ 12 j staff 384 Mar 2 18:34 .. -rw-r--r--@ 1 j staff 128 Mar 2 18:23 settings.local.json

[Bash] which claude 2>/dev/null && claude --version 2>/dev/null || echo "claude CLI not found in PATH"

[Bash] which python3 && python3 --version && python3 -c "import pdfplumber" 2>&1 || echo "pdfplumber not installed"

[result] /opt/homebrew/bin/claude 2.1.62 (Claude Code)

[result] /opt/homebrew/bin/python3 Python 3.14.2 Traceback (most recent call last): File "<string>", line 1, in <module> import pdfplumber ModuleNotFoundError: No module named 'pdfplumber' pdfplumber not

[Read] [REDACTED]/settings.local.json

[result] 1→{ 2→ "permissions": { 3→ "allow": [ 4→ "WebFetch(domain:bsky.app)", 5→ "WebFetch(domain:public.api.bsky.app)" 6→ ] 7→ } 8→} 9→

[Bash] claude --help 2>&1 | head -80

[result] Usage: claude [options] [command] [prompt] Claude Code - starts an interactive session by default, use -p/--print for non-interactive output Arguments: prompt

[Bash] python3 -c "import pymupdf" 2>&1; python3 -c "import fitz" 2>&1; python3 -c "import PyPDF2" 2>&1; python3 -c "import pdf

[result] Traceback (most recent call last): File "<string>", line 1, in <module> import pymupdf ModuleNotFoundError: No module named 'pymupdf' Traceback (most recent call last): File "<string>", line 1

[Bash] ls -lh [REDACTED]/*.pdf

[Bash] claude -p --help 2>&1 | head -20

[result] -rw-r--r--@ 1 j staff 41M Mar 2 18:28 [REDACTED]/brief-cadc.pdf -rw-r--r--@ 1 j staff 3.5M Mar 2 18:28 [REDACTED]/brief-tx-5th-dist.pdf -rw-r--r--@ 1 j staf

[Bash] pip3 list 2>/dev/null | head -30; echo "---"; python3 -c "import subprocess; print('subprocess ok')" 2>&1

[result] Usage: claude [options] [command] [prompt] Claude Code - starts an interactive session by default, use -p/--print for non-interactive output Arguments: prompt

[result] Package Version ------------ -------- certifi 2026.1.4 cffi 2.0.0 cryptography 46.0.5 pip 25.3 pycparser 3.0 wheel 0.46.3 --- subprocess ok

[Bash] claude -p --output-format json "echo test" 2>&1 | head -20

[Bash] # Check if there's a way to pass files to claude CLI claude -p --help 2>&1 | grep -i "file\|read\|input" | head -10

[result] Error: Claude Code cannot be launched inside another Claude Code session. Nested sessions share runtime resources and will crash all active sessions. To bypass this check, unset the CLAUDECODE environ

[result] -d, --debug [filter] Enable debug mode with optional category filtering (e.g., "api,hooks" or "!1p,!file") --debug-file <path> Write debu

[Bash] # Check what's available for PDF handling without pip install which pdftotext 2>/dev/null && echo "pdftotext available"

[Bash] # Check if uv is available for lightweight venv creation which uv 2>/dev/null && echo "uv available" || echo "uv not ava

[result] pdftotext not found pdftotext not available pdftk not found pdftk not available mutool not found mutool not available qpdf not found qpdf not available /usr/bin/textutil textutil available /usr/bin/md

[result] /opt/homebrew/bin/uv uv available pipx not found pipx not available

[Read] [REDACTED]/brief-ysleta-petitioners.pdf

[result] pdftoppm is not installed. Install poppler-utils (e.g. `brew install poppler` or `apt-get install poppler-utils`) to enable PDF page rendering.

[Read] [REDACTED]/brief-ysleta-opposition.pdf

[result] <tool_use_error>Sibling tool call errored</tool_use_error>

[Bash] uv --version 2>&1

[result] uv 0.9.28 (Homebrew 2026-01-29)

1918:59:46

thinking18 lines · 1787 chars

Let me synthesize everything I've learned and write the plan. Let me think about what's realistic and what the key decisions are. Key findings: 1. No PDF tools installed, but `uv` is available for zero-install script execution 2. Claude CLI v2.1.62 available 3. 4 PDF briefs, all likely containing their own TOAs as ground truth 4. The skill already has bug fixes applied 5. Need a pipeline: PDF → pages → extraction → scoring → iteration The plan agent's design is solid. Let me refine it and write the plan file. One concern: the `claude -p` approach for running the skill. I need to think through the execution environment carefully. The skill expects Claude Code's Task tool for subagents, but when I run it via `claude -p`, I need to make sure the agent knows to use Task and that `--allowedTools` includes it. There's also the question of whether calling `claude -p` from within Claude Code itself will work—the plan agent mentioned a potential nested-session error. The `CLAUDECODE` environment variable might interfere, but unsetting it before the call should resolve that issue. Actually, I'm reconsidering the whole approach. The user mentioned using subagents to loop over testing, which might mean leveraging the Task tool directly from this session rather than building separate shell scripts. But I think the most practical path is to build the Python scripts for the core logic, then orchestrate everything through Bash—running the extraction with `claude -p` (after unsetting `CLAUDECODE`) and managing the iteration loop from Claude Code itself. That way the user stays in one place while I handle the execution details. I should also keep in mind what User B actually wanted from the skill improvements, starting with generalization across different brief types.

2019:00:23

Write

[REDACTED]/eventual-frolicking-engelbart.md

1 lines returned

File created successfully at: [REDACTED]/eventual-frolicking-engelbart.md

2119:00:28

ExitPlanMode

{
  "allowedPrompts": [
    {
      "tool": "Bash",
      "prompt": "install Python packages via uv or pip"
    },
    {
      "tool": "Bash",
      "prompt": "run Python scripts for PDF processing"
    },
    {
      "tool": "Bash",
      "prompt": "run claude CLI for extraction and ground truth parsing"
    },
    {
      "tool": "Bash",
      "prompt": "create directories for project structure"
    },
    {
      "tool": "Bash",
      "prompt": "run test/scoring scripts"
    }
  ],
  "plan": "# Plan: Cross-Brief TOA Extraction Test Harness + Skill Iteration\n\n## Context\n\nUser A built a Claude Code skill for extracting Table of Authorities from legal briefs. User B (a paralegal) correctly identified the core problem: the skill was developed against one brief and tested against one other — it's overtrained. She provided 4 briefs spanning SCOTUS, D.C. Circuit, and Texas state court, and wanted all tested. We need a pipeline that runs the skill against all briefs, scores it against ground truth, and lets us iterate until it generalizes.\n\nThe skill has already been partially fixed (parenthetical inclusion, filing exclusion narrowing, statute granularity, roman-numeral pages). Now we need to validate those fixes and find remaining issues across all brief types.\n\n## File Layout\n\n```\nlawthing/\n├── scripts/\n│   ├── pdf_to_pages.py          # PDF → per-page text files\n│   ├── extract_ground_truth.py  # Extract TOA section from PDF front matter\n│   ├── score.py                 # Compare extraction vs ground truth\n│   └── run_brief.sh             # Run extraction on one brief via claude CLI\n├── data/\n│   ├── ysleta-petitioners/\n│   │   ├── pages/               # page-i.txt, page-001.txt, ...\n│   │   ├── ground-truth.md      # TOA from the brief itself\n│   │   └── extraction-output.md # Skill output\n│   ├── ysleta-opposition/       # (same structure)\n│   ├── cadc/\n│   └── tx-5th-dist/\n├── skills/\n│   └── extract-toa/\n│       └── SKILL.md             # Working copy of the skill (iterated on)\n├── extract-toa-skill.md         # (existing, reference copy)\n├── brief-*.pdf                  # (existing, 4 PDFs)\n└── ...                          # (existing files)\n```\n\n## Steps\n\n### 1. Install nothing — use `uv run --with`\n\n`uv` (0.9.28) is already installed. All Python scripts run via `uv run --with pdfplumber` / `uv run --with rapidfuzz`. No venv, no requirements.txt.\n\n### 2. Build `scripts/pdf_to_pages.py`\n\nExtracts text from a PDF, one file per page. Key behaviors:\n- Detect page numbers (standalone numbers on their own line) — use as display numbers\n- Roman-numeral front matter pages: `page-i.txt`, `page-ii.txt`, etc.\n- Each file starts with the page number on line 1 (matching skill expectations)\n- Preserve line numbers within each page (tab-separated second column) for debugging\n- `--page-range` flag for partial extraction during dev\n- Handle the 43MB CADC brief (pdfplumber loads lazily, so memory is fine)\n\n### 3. Build `scripts/extract_ground_truth.py`\n\nTwo-stage ground truth extraction:\n- **Stage 1 (Python/pdfplumber):** Find pages between \"TABLE OF AUTHORITIES\" heading and next major section. Output raw text to `toa-raw.txt`.\n- **Stage 2 (Claude CLI):** Parse raw TOA into structured markdown tables (same format as `toa-draft.md`). Claude handles the formatting variation across courts (dotted leaders, continuation pages, column layouts).\n\nOutput format — markdown tables by category with `Authority | Pages` columns.\n\n### 4. Build `scripts/score.py`\n\nCompare extraction output vs ground truth. Key design:\n- Parse both into `(authority_text, page_set)` tuples per category\n- Match using key-component extraction: `(volume, reporter, start_page)` for cases, `(title, code, section)` for statutes — more reliable than fuzzy string matching\n- Fall back to rapidfuzz (threshold 85%) for authorities that don't match structurally\n- Score: correct / false_negative / false_positive / page_discrepancy / ambiguous (passim vs explicit)\n- Output `score.json` per brief + optional `--aggregate` mode for cross-brief report\n- Classify false negative causes heuristically (parenthetical-only, roman-numeral page, standalone section, filing exclusion, etc.)\n\n### 5. Build `scripts/run_brief.sh`\n\nRuns extraction on one brief via `claude -p`:\n```bash\nunset CLAUDECODE  # avoid nested-session error\nclaude -p --dangerously-skip-permissions --model sonnet \\\n  --max-budget-usd 5.00 \\\n  \"Read the skill at skills/extract-toa/SKILL.md. Extract TOA from pages in data/<brief>/pages/. Output markdown tables.\"\n```\n\nUses sonnet for iteration speed, opus for final validation. `--max-budget-usd` prevents runaway subagent chains.\n\n### 6. Copy skill to working location\n\nCopy `extract-toa-skill.md` → `skills/extract-toa/SKILL.md`. All iteration edits target the working copy.\n\n### 7. Run pipeline: pages → ground truth → extraction → scoring\n\nOrder of briefs (smallest/best-understood first):\n1. `ysleta-opposition` — has existing error analysis, known ground truth\n2. `ysleta-petitioners` — same case, SCOTUS format\n3. `tx-5th-dist` — state court, different conventions\n4. `cadc` — largest (43MB), run last\n\n### 8. Iterate on skill\n\nLoop:\n1. Run one brief, check `score.json`\n2. Identify error pattern\n3. Edit `skills/extract-toa/SKILL.md`\n4. Re-run, confirm improvement\n5. Run all briefs, check cross-brief report\n6. Repeat until accuracy stabilizes across all 4\n\n### 9. Generalization improvements to the skill\n\nBeyond the bugs already fixed, the skill needs:\n- **State court record abbreviations:** Add `CR`, `ER`, `TR` to the filing exclusion examples\n- **State reporters:** Add `S.W.`, `S.W.2d`, `S.W.3d`, `N.E.`, `A.2d`, etc. to the reporter list\n- **Flexible categories:** Don't assume all 4 standard categories exist — state briefs may lack Legislative Materials\n- **Line numbers in extraction output:** Add to intermediate extraction results (for debugging), not to final TOA\n\n## Verification\n\n- Each brief's `score.json` shows accuracy metrics\n- Cross-brief report identifies patterns that span briefs\n- Target: match or exceed the original skill's accuracy on the opposition brief (~83 authorities found) while achieving comparable accuracy on the other 3 briefs\n- Final validation: re-run all 4 briefs with opus model, compare to sonnet iteration runs\n"
}