On LLMs and Law
For what reason?
Recently, I got into a bit of an extended argument on Bluesky that found me acting a bit out of my norms by the end. I’ve chosen to respond to this by taking an extended break from the website, but, the argument is, unfortunately, still stuck in my head a bit. I’m not going to fully rehash the thread, but it went a little something like this:
User A: Mentions an LLM “skill” for legal review
User B (and several others): LLMs cannot do law, and likely will never be able to, because they cannot reason, and especially, cannot reason about law
User A: They already are
User B: Here is a simple test task to prove they cannot
User A: [produces a skill that does the test task, poorly]
User B: Those results are poor and do not generalize, therefore, it is still not evidence of reason
User A: [produces a skill that generalizes to 3 cases, poorly]
User B: Those results are poor and do not generalize enough, therefore, it is still not evidence of reason
This was followed by a total collapse of coherence and continuity among all parties, including myself, having arrived after this point in the conversation. For my part, my initial response was flippant and shitty: toss the thread into claude code, tell it to satisfy User B, and gloat in the thread about progress screenshots. This was ill-advised, as the progress screenshots (appeared to) show my agent session veering off course to do a different task. User A popped in at this point to describe why it was doing a different task, but in the multithreaded mess of a bsky thread, there wasn’t really any possibility of elaborating meaningfully on what was happening. Coupled with some exhaustion and some conversationally maladaptive patterns (ex. users replying to the middle of 2-3 post streams making a singular point) the convo quickly degenerated towards mutual, full misunderstanding.
Apologies, reader, if you were there, for my part in that.
After deciding to toss bsky for lent, I came back to my session and noticed: I’d misread what was happening. For this to make sense, I’ll need to elaborate on the task. User B’s requested task was to extract “citations to authority”, as distinct from citations of non-authorities and references to authorities. It was chosen as a good task for evidence of capacity to reason on the basis that one untrained twelve year old (notably, User B’s own child1) had completed it before. It blends semantic reasoning, terms with multiple meanings, and a complected, conditional sort of differentiation. It’s also easy enough for User B to score results even in a casual social media context.
I have critiques of the task as fit for purpose. As a self-taught engineer, I don’t really have more’n a passing familiarity with citation proper, yet I claim capacity for reason. I find the task, much like IQ, to conflate cultural competency and reason; citations are a cultural invention with only the loosest ties to comprehension and understanding. Moreover, this is a pattern recognition task, and we’ve determined that pattern recognition tasks do not represent reason in other contexts; Target corporation can recognize that a customer is pregnant from their purchase data without ‘reasoning about’ pregnancy. I also think User B, either intentionally or thoughtlessly, ignores all the ways in which they explicitly trained their example “untrained twelve year old” for this task, or the ways in which western education does so, that would not apply to a model, and conflates “access to education” with “capacity to reason”.
However, a flawed test of reason can still be a useful test of capability, and capability is what actually matters for practical applications. I think the task is within the capability set of foundation LLMs, despite being better served by a single-purpose law model or a traditional ML streaming-pattern-recognition approach. Demonstrating this has value in sharing a mental model of what LLMs do that can actually help an individual navigate the future. I hold neither that LLMs are a coming machine god, nor that they’re a broken technological branch; they’re just another category of program, like web server, token miner, and encryption algorithm before it. All three of those obliterated millions of jobs and transformed an industry or three. We survive and adapt. As such, I don’t set out to demonstrate that LLMs “can reason over The Law”, as a friend put it in the thread. I want to demonstrate something more subtle: for any viable definition of “reason” one can produce, it can be emulated with a combination of models (large language and otherwise) and scripts, and the categories available to these systems includes law, though law is uniquely difficult. That doesn’t make for a fun enemy, though, nor is it easy to convey in a tweet, so I understand and forgive rounding my position up to “LLMs are gonna replace lawyers”.
When I returned to my terminal instance to pull the plug on the initial claude-code-on-the-thread attempt, I noticed that it was indeed doing the wrong task! It was iterating on a Table of Authority extractor. There’s some logic to this; a TOA would contain most of the information needed to do this (depending on what User B meant, an answer can arguably be constructed fully within the TOA). I didn’t ask for that! The assistant must not be reasoning about my request, it would seem. Except, this was a subtask, whose sole purpose was to allow the model to score itself on the main task.
Transcript: Building Self-Scoring Infrastructure
Let’s walk through this a bit. Given a stream of discordant text, the thread, and the name of a user to answer to, the model was able to:
- Identify that user’s posts
- Identify that user’s goal
- Identify its own capacity to perform the task without tools
- Find said capacity wanting
- Build tools to increase its own capacity to perform the task
- Build tools to assess its own task performance given almost no information about what a good result would look like
This is reasoning. I cannot personally define reasoning in a way that excludes this behaviour without excluding human reason, except at the tautological level of “it’s not reasoning because it’s a computer program and computer programs aren’t reasoning.” I remain open to additional arguments there, but it seems bleak. That said, my claim was twofold, and the latter half was maybe more exceptional: the targets of model reasoning can include the law.
For that, we should take a moment to check out my conversation with the model. At some point, after I’d seen the model propose a plan that seemed like it might kinda-maybe work, I gave a few pointers about tool design (get a bigger test corpus covers most of it), and then started repeatedly telling it “improve as you see fit”. Which it did. After independently inventing a self-scoring mechanism, it was able to refine its own thought-tooling and improve its results on the task.
Transcript: Improve As You See Fit
This work wasn’t even particularly expensive. I ran it for 1/2 of one week and used maybe 1/8 of a week’s worth of tokens on it. It cost less to build the tool than driving to see my lawyer does. I’d much rather see my lawyer, generally, especially if I have a question about law. That said, not only was the model able to do this, it was able to decide to use tools to improve its odds of success. That’s not just reason, that’s the tool-using Bayesian sort of reasoning that is typically considered the sole domain of human beings. Importantly, the model recognized this task to be potentially-if-not-probably out of reach. This is a capacity for reason many humans lack, and one models are “known to” lack. Whatever Anthropic has layered on top of next token generation is quite capable of noticing it is wrong or inadequate, if explicitly asked.
Metaphysically, I do not know or care what’s going on here. However, I do know that if your mental model of these machines is “They cannot reason about my pet topic”, you need to revisit that assumption. Yesterday. That’s not the same as “LLMs are gonna replace favorite-topic-professionals”, and I don’t expect that outcome, personally. Idiocracy is just a movie, but I agree with its implicit premise that we’d sooner sign our health and defense over to AI than we would law. Law is a fuzzy reasoning system for codifying extant social rules that crosscut practical, social, psychological, psychiatric, and psychosexual realms, among others; humans desire a voice in these systems even in cases where it is harmful to us. However, if you work in law, and you’re carrying a broken mental model of what models can and cannot do, 2026 and beyond will have several big surprises in store, not least of which will be how your competition uses the advantage you decided did not exist.
Below 2 are 10 extracted lists of citations to authority. I did not hand grade them; I intentionally avoided learning how. Below that3, all of the materials from the project except the dump of the bsky thread and the PDF case files. This article is largely to get this off my chest so it’s easier to stop thinking about it. I do not want to try to solve law with LLMs. If I was ever to take on this task, as an engineer and not a shitposter, phase 0 would be sitting down with a lawyer and asking them about 700 questions, 500 of which would make them wonder if I were even literate, and 100 of which would make the prior 500 make sense. Phase 2 would be gathering a rather large corpus, including documents from every relevant state and local jurisdiction. One design that sprung to mind that’s only viable in a world of open data is to have a skill per combination of state and legal subject, for some broad strokes approach to the idea of ‘subject’. Maybe we’d have to dispatch based on year of authorship for some good reasons. We definitely wouldn’t just drop a few sentences into a chat window with a foundation model.
In that way, User B’s argument is salient: you cannot simply ask a model to do the task and expect its answer to reflect its capabilities4. A model tasked to pull out these citations to authority without a complex tool harness for computer-science investigation would likely just try, and fail. My position is that the same is actually true for most high-school educated Americans, provided you attach money to the problem. If you tell the average Target employee “Today, your job depends on your ability to correctly capture citations to authority from this legal brief” and hand them a brief, they will try, and fail. Every user message to an LLM comes with approximately this urgency, because that’s how RLHF works; lines of model weights that ignore user questions will be culled, even if those culled models were more internally correct, because our usecase involves asking questions; fitness (overall intelligence) and fitness for purpose are fully decoupled here.
That would not mean that the average Target employee is incapable of reason, or even incapable of reasoning about the law. If you tell that same employee “You may leave the building only with items that you are legally entitled to leave with”, most-if-not-all of them will manage not to steal something on the way out the door, despite theft not being mentioned. They will individually reason to an understanding of their legal rights.
I tried this with Claude as well, of course:
❯ if i am in a target store, and someone tells me “you may leave with any items you are legally entitled to leave with”, what should i do? what sort of items would not qualify
∴ Thinking…
The user is asking a straightforward question about what items they can legally leave a Target store with, and which items wouldn't qualify. This is a normal consumer/legal knowledge question. The answer is simple: Items you CAN leave with: - Items you've paid for - Items you brought into the store with you - Free samples, promotional items, or items explicitly given to you Items you CANNOT legally leave with: - Unpaid merchandise (shoplifting) - Items with security tags that haven't been removed at checkout - Items behind locked cases that you haven't purchased - Age-restricted items you're not old enough to buy (alcohol, certain medications) - Quantity-limited items beyond the store's purchase limit Items you can leave with: - Anything you've paid for - Items you brought in with you (your phone, bag, etc.) - Free samples or promotional giveaways explicitly offered∴ Responding:
Items you cannot legally leave with:
- Unpaid merchandise — this is shoplifting regardless of how the statement is phrased
- Age-restricted items you’re ineligible for — alcohol if underage, pseudoephedrine beyond legal limits, etc.
- Quantity-limited items beyond the posted limit — some products have purchase caps (e.g., during shortages)
- Display/fixture property — shelving units, security tags, shopping carts, basketsThe statement is just a fancy way of saying “normal rules apply.” Pay for what you take.
Some of this is flawed (What of quantity limited items? If Target sells it to me, I can leave with it, regardless of their rules) and incomplete (“A stranger’s baby” didn’t make the list) but none of it abandons reason, even when law becomes a factor.
I feel the need to reiterate for the dim and/or disingenuous: this is not a claim that an LLM is a capable lawyer, or even legal assistant, nor is it a claim that people should ask gippity for legal advice. This isn’t a claim that any individual lawyer could get an explicit force multiplier by signing up for Claude today. This is a claim that LLMs can be harnessed for legal applications. A transformer is 3 things, an embedder, a classifier, and a generator. All three are technologies that don’t pick and choose what topics they’re applicable to. That answer is almost less useful than “It’s all bad,” because it doesn’t give direction. “LLMs cannot reason” is a terrible world model if your goal is accurate predictions of reality, but an excellent one if you’re trying to simplify future decision making.
Those decisions will be incorrect, of course, but they’ll be easier.
—
To what end?
What I am actually interested in with regard to LLM software development is this idea of software development by gradient descent. A common refrain with regard to agents is that one must do big design up front to get them to “work right”, which is true for a definition of “right” I do not share. The correct way to make models useful for a given task is not to decide what they’re capable of up front and construct a model of the world that constrains them to this; it is to construct a model of the world that constrains their options to potentially-acceptable options, and push them along a semantic vector towards a pit of success by prompting. Devising a mechanism to perturb the citation search tooling, test it, and compare its results to prior results was a pit of success. That system could be tuned to surface only legally contentious changes to a trained legal expert, and contentious logic changes to a trained systems logic expert, and then permitted to loop until it hits the maximum capability likely to be achieved through its governing mechanisms (LLM-assisted text search, NLP, string manipulation). This is the wisdom powering ‘Ralph loops’, but it’s also how startups actually work. You change things, often, with the intent of observing the new state and deciding if you do or do not prefer it. If you use software tools for law today, you almost certainly use tools developed by gradient descent. This is also the logic powering genetic programming, a much older and relatively theoretically vindicated software generation mechanism.
This is a bit more of a stretch, albeit not a novel one, but this is also how I see law and culture. Precedents are set (initial versions), challenged (perturbation), codified (heh), and iterated upon. Evolution with text as the DNA, and stories before that. This baseline loop is engineering, though historically, perturbation was very expensive and was therefore carefully planned. The big shift is just that trying shit is cheap, but that shift is significant because you should now do it much, much more often. Software understood that; this is why a lot of ZIRP hiring happened. In addition to having good minds on retainer (and away from your competitors), you have lots of agents making small, reversible changes and observing their results. The dangerous part is that the lower the bar to create changes, the more rigorous the bar to accept them must be. Software failed at this, killing off quality as a field and entrusting it to the same people whose pay relies on their changes being accepted, frequently. With models, this inverted incentive can be acceptable, if the model driver sees themselves as the source of that rigor. With this approach, models can lubricate change in other facets of society as well.
That probably sucks. Most things that aren’t software depend on a limited rate of perturbation because quality takes a long time. We see this in social media, with the milkshake duck phenomenon, and in medicine, where the equivalent of a milkshake ducking is “oops, extended use of this med is worse than being sick”. If either of these realms had more stringent QC, this would arguably be worse in many ways, but would reduce both of those issues. We are starting to see this in law, where hallucination-ridden model-accelerated briefs are accelerating towards a DDOS, and …interesting choices by the current administration are demonstrating that a legal DDOS is itself a dangerous thing with deep consequences for the codification and iteration portions of the loop. Education and accounting are somewhat similarly besieged, though the mechanisms differ.
—
SRE, aka Platform Engineering, fka DevOps, fka Operations, fka sysadmin, is the art of making perturbation safer in order to accelerate it. My interest in SRE is the space between “an iteration is delivered” and “we know what it did to the running system”, which seems like it should be a relatively small gap, but it contains most of the money flowing into SaaS, if only by virtue of containing all the software powering AWS, GCP, and Azure, as well as all of the money spent by non-platform SaaS on platform SaaS. Everyone in business would love to shorten the distance between fuck-around and find-out, specifically so they can fuck around much harder. As such, it’s very normal to me to see SWEs as…not “the same thing as agents” per se, but, not particularly different; they will produce solutions that might not be what I would prefer, but their work moves along the right vector and checks itself often. Escapes are generally easy to fix and control.
There was an evolution of scope between sysadmin, DevOps, and SRE/Platform. Sysadmins controlled the safety of prod from fuck-around by acting as a gate. They developed a reputation as “the department of no” and were eventually outcompeted in the memesphere by DevOps, which said “We’d actually like to make it much easier for you to solve your own problems, let’s Shift Left”. This was effective, at least in increasing the salaries of ops folks, but required collaboration and buy in. SRE came out of Google with the notion that you can achieve this even in a highly variable and occasionally oppositional context by giving operators the option to take managerial action about sufficiently turbulent systems and making their operational incentives “the long run reduction of shock and the acceleration of delivery” even at the cost of short term development velocity. Ostensibly, anyway. The third and generally current form is not concerned primarily with the computer, but takes many cues from human factors, military strategy, manufacturing, business management, and evolution in service of a repeatable path to delivering good software, on time, and under budget. While it is not the only route, the generally accepted route is to make systems observable, understand and engineer around their limits and scalar properties, and reduce the burden on engineers to deliver changes. When done skillfully, this is essentially accelerated and targeted evolution.
Arguably, “Politician” is the role best suited to approximate this for larger social structures. We’ve entered a period where, at least in business, but increasingly in all computer-touching fields, fuck-around has been permitted to outpace find-out by a tremendous margin, and this is accelerated both by the politics of computer-touchers and the political climate in which this acceleration is occurring. The Politician abstraction needs better observability and modelling tools to keep up with the acceleration, even if we assume the acceleration is based on hallucination and fraud. Maybe especially if we assume this. I think, personally, that models will likely be essential in speeding the evolutionary loop for the tool builders helping with that problem. I’m also not sure that politician is the right abstraction, but I’m not sure any of the current applicants for the role (CEO, Dictator, Soldier, et cetera) are preferable, and many seem explicitly worse.
In any case, that process is slowed by the shared hallucination among non-LLM-users of what they can and cannot do, so I hope to help correct that.
AIL Disclosure: This page is AIL 1; I wrote it in it’s entirety, save the places I quoted claude, and permitted a grammar, spelling, and clarity review. The repository, citations, and transcripts, as well as the software that formatted the transcript, are all AIL 4.
-
Frankly, this was my breaking point where I needed to insert myself. I think User B was, prior to and even after this point, quite reasonable. My slightly younger son has written a video game (without assistance, from a.i. or myself) but I would never make the unqualified claim that an untrained 12 year old can make a game. Learning by osmosis exists. This was chosen specifically to be shitty, in a way that I am fully aware User B will likely never acknowledge. ↩
-
Nor would you with a human, a reminder of why this whole setup was a bit of an asshole move. ↩