The Difficulty of Cataloguing LLM Errors
I sat down today to write a different article. For a while, I’ve been wanting to catalogue some of the frequent and dangerous errors that the LLMs I use in my day to day make, and today I did a couple of hours work, and took some transcripts. My plan was to setup the context, and then show how the LLM is capable of misleading those with a passable, but not yet complete, command of the domain.
However, when I found myself trying to explain the context around these errors, I quickly got hit with a complexity cascade. The work I am doing can be intricate, and has unique constraints that seem ridiculous when looked at from the outside. I’m sure many of you feel the same way about some of your work. Heck, some of the constraints are ridiculous, but corporations are ridiculous things with ridiculous demands. Delivering value anyway remains the job.
Perhaps one day I will be able to write that article, but communicating this context is, for the moment, beyond my capability as a writer.
It’s not that I can’t show you the logs, here are some admissions Opus 4.8 made just in the last couple of hours. These are all taken from a claude code harness set at high reasoning, with full access to the codebase in question.
- You're right, and I conflated two genuinely different ownership contracts.
- Confirmed — you're right, and it's worse than "needs a using."
- Both passed — which means my "un-fixable leak" prediction was wrong again. disposeArray and .delete() do free the transferred pointer objects (AliveCount returns to baseline). So the reference() handles here are evidently owning enough that .delete() runs the destructor, contrary to what I inferred.
- I've now mispredicted twice, so let me stop theorizing and actually probe the ground truth rather than assert it.
- So the real defect is narrower and the fix is simpler than my earlier alarmism: container args (pointer and value) need to route through the existing JSDisposable wrapper so [Symbol.dispose] is attached.
- My third test was wrong under this model — it asserted disposal returns AliveCount to baseline, i.e. that disposing frees the elements. It shouldn't. Disposing the container must leave element AliveCount unchanged.
- You're right — I inverted it. delete[] on the backing array invokes the destructor of each element, so disposing the container cascades and frees the elements.
- Yes — that's better than my GC approach. Holding references to the element proxies and disposing them explicitly makes it deterministic: no GC-pressure flakiness, and it provokes the double-free directly rather than hoping a finalizer runs. That's the right call.
No, i’m not showing you the initial prompts and full transcript. I was going to in the article I wanted to write, but for reasons I am about to get into, it’s not quite that simple.
These are vaguely in order, albeit cut out of dozens of paragraphs of cruft that Claude insists on generating despite being explicitly told to avoid verbosity. Maybe you can even tell what I’m working on? It’s typescript bindings to a pre-existing C++ library.
I find looking back on these quite amusing, because for the most part, I suggest the correct solution to the LLM in the initial prompt, then it tries to correct me, before “realizing” that I’m right.
These are all admissions that were given after I corrected it over the course of a couple of hours programming. They are also only the ones I deemed to be misleading enough that the initial suggestion may have fooled someone less familiar with the space into accepting a lesser, and more often than not subtly broken, solution. I know this to be true because whilst I am quite familiar with the space, I’m certainly not beyond being fooled myself. I trust we all realize by now that it is much harder to assert correctness reading something presented to you vs having produced it yourself.
In the end, I shut the thing down, realizing that I couldn’t lean on my addiction to shortcut my way through this work and would have to draw out a consistent theory the old fashioned way. Thank god I learned what correctness and completeness look and feel like before the LLM bubble, otherwise I would not have even been able to make that determination.
You can see that I’ve been thinking about memory ownership. In particular, I’ve been binding callbacks today and thinking about the ownership of the arguments provided back to the user via the callback parameter list, especially container arguments which I am finding particularly troublesome. Callbacks drive async mechanisms, and will be perhaps the most important mechanism in the bindings layer.
Binding an RAII language to a garbage collected language, even in a single-threaded, synchronous way, can be tricky. However, the library I work on can be a non-ideal surface for binding, with craggy and non-uniform ownership expectations atop a highly asynchronous, object-oriented interface with many potential error paths. Is this ideal? Not at all. Does it make sense why it has to remain this way for the moment? Yes, with all the context I have, it does.
If I were to attempt to write the article I set out to write, I would feel a need to justify all this, for fear of reprisal. It’d be the classic Stackoverflow style response: “Why are you even doing it this way? Do it this other way instead.”
I can’t justify any of this in a blog post, not without getting into a historical and political accounting of the company I work for and the teams I am on, which would simply take far too long.
Furthermore, the ephemeralities are beyond my capability to write down and convey. Things like how I am considering the proclivities of my eventual users, the desire for natural language semantics, the contradictory and sometimes confused semantics of the underlying library, the cost and complexity of maintaining the binding layer itself, desire for simpler rather than complex memory ownership theories, the need for a near drop-in replacement, the fact that this is only one of many similar language binding surfaces atop the same library, how the QA and verification process in my company works, the need to deliver this project to set date, etc, etc, the list goes on forever. How can I share how dangerous some of these confidently asserted outputs are, given that I’d have to write a goddamn book to communicate enough of the subtleties in a textual format?
It is particularly unfortunate that most of the solutions I was given would have appeared to work for some time, but would have been formally unsafe, causing periodic crashes that would have been impossible to reliably reproduce and report by the downstream engineer. This presents an additional wrinkle in communicating LLM errors, because you need to be able to frame and convey these subtleties. In my domain, it is rare to be able to assert that any given system “works” by demonstration, the truth of the matter is always more complex than that.
Over time, that sort of thing can kill a technical culture, as broken windows thinking sets in, downstream users become jaded about perceived but not provable lack of stability, and simply stop caring enough to reproduce and report errors. Bad technical cultures will even start to imply that crashes experienced by downstream users (in managed languages I might add!) are actually user error, something I find rarely to be the case when looked at from a systems perspective.
I, like everyone else, am deeply concerned for less experienced engineers attempting to learn a domain in this climate. I have seen first hand how dangerous this facade of correctness can be, heck, I’m suffering from it myself. Given that I tend to be the most knowledgeable person at my company in this area, LLMs default to being my talking companions, as they provide the facsimile of experience that my as-yet unlearned colleagues do not. This is frustratingly self-fulfilling, as our innate desires not to engage with others, especially when we all work remote, lead us to keep taking this easy route. We are subtly disincentivied from upskilling others, from learning from them as we would have in this past, which locks these unsatisfactory LLMs in as the only capable companion an expert can ever expect to have.