Kirkby and Matuschak on making flashcards with LLMs

22 Apr, 2026

Here are Ozzie Kirkby and Andy Matuschak reporting on their attempts to make LLMs make and evaluate good flashcards based on their reading notes. It's both well-written¹ and relevant. They work with good ideas of how flashcards ought to be written, or at least ideas that are close to mine.

Some notes:²

This is what it looks like when the authors actually care about the thing they're studying: diversity of approach, creativity, and tenacity. I evaluate this whole genre of essay on a spectrum from "really care about figuring this out" to "trying to get an A from some real or imagined teaching assistant." This is very much on the good side.
The project of fully automating the highlight-to-card process via LLM is, to me, undermotivated. The larger pipeline here includes reading, highlighting, thinking, card composition, and intermittent study. Given that, I'm not so worried that, e.g., "even the strongest model we tested (GPT-5.2) still produces unusable prompts roughly a third of the time." A strong flashcarder should be able to recognize and cull those very quickly, especially relative to the overall time commitment of studying something.³ Here as elsewhere, I'm less concerned than others about whether AI can do 100% of something, and more concerned about whether they can do 25%, 50%, 75%, and 95% of it.
I'm glad that they tried fine-tuning, and found their various efforts here useful. Again, I draw a more optimistic conclusion than their "we got cheaper judges, not better ones," or perhaps the same conclusion in a more optimistic tone. Cheaper judges are good! I'm particularly interested in whether several cheaper judges could be aggregated, either now or in a near future of more sharply distinguished models.
I'm grateful for their work in evaluating all those models so carefully, but am still in the "benchmarks have never been less useful" camp.
I strongly agree that the training data are mostly bad and that most flashcarders' processes are not optimized for the sort of learning that interests Kirkby and Matuschak here. I disagree in places, however, with their views on how highlighted material should be captured in a flashcard. So, for example, I don't think it's so bad simply to memorize the traditional three factors of production.⁴ I'd also go about studying their "humans flying by flapping wings on Titan" example differently, but the details here would require another post.
As an experiment, I asked Claude to make me 60 flashcard candidates from my Norton-anthology highlights: I'm still studying 19th-century British literature and using it as a way into the politics and history of the period. Most of the candidates were bad, but many were usable or editable. Claude and I working together were a lot more efficient than me working alone. This is in part because because Claude had access to a local SQLite database of all my questions and responses and could query it to learn about how I write this kind of card, where my library's gaps are, and so on.⁵ Again, culling and editing wasn't the time-consuming part (and note that this culling and editing is both educational and, for me, pleasant!).⁶

So: this is valuable work (unless I'm forgetting something, the best thing I've read about spaced repetition this year), but I'd encourage a different picture of human-AI cooperation in spaced repetition. Relatedly, I'm more optimistic than the authors about LLMs' helping us make good flashcards.

...and, Pangram and I agree, at least mostly written by humans.↩
The usual disclaimer: I make Zippyflash and have used it for years. I'm invested in Zippyflash on many dimensions: primarily as a user, but also financially and ideologically.↩
This is why Zippyflash distinguishes fundamentally between cards and card candidates, and the API makes it easy for LLMs to submit candidates for your review. For a bit more about LLM-centric API design, see here.↩
Whether these should be memorized as one card or many depends, I think, on (e.g.) how likely you are to remember some but not all of them. I'm not saying that "What are the traditional three factors of economic production?" is the right flashcard, but just that memorizing this "textbook list" is not a bad idea even if your goal is a much different kind of understanding.↩
The details here are less important than the idea that, again, good flashcards are less likely to come from a simple highlight-to-card LLM call and more likely to come from a more complicated human-AI partnership. (For now, at least.)↩
It's not so hard to get your Kindle highlights into a format your LLM can use, but it's a clunky process.↩

#generative AI #reading notes #software #spaced repetition