On Anki's database

I am a daily spaced repetition user (I describe how and why here and here). I've been working with my Anki data lately, which has meant exposing and manipulating the Sqlite database underlying the program. (I'm grateful to the Ankidroid project and r/anki for having documented so much about the .apkg format and the database itself.)

The database structure is fairly typical of data management in long-running projects. And, typically, some of its design looks rough to fresh eyes. When fresh eyes see rough parts of a software project, it's important to write about it promptly, and the Anki community is large and dedicated enough that it seems worth doing that writing in public. So here's what I think could be better.

Caveats

  1. Anki is, I think, maintained by a very small team. Some of the notes about the intelligibility of the schema matter less if there aren't many people who need to work with the database every day. You can think of the relevant notes as things that are merely suboptimal now and would be more damaging if the Anki project were to grow.
  2. The underlying database is, I think, not (or at least not primarily) intended for direct manipulation by program users. So there's a sense in which it doesn't matter very much how intelligible I think the data management is. That said, insofar as it affects the accuracy or performance of the software, it certainly matters. And given that people turn over parts of their minds to Anki, it's especially urgent to document these things.
  3. Anki is great and all this comes from a place of love.

Principles behind these notes

You can think of this as a data-oriented extension of my code review reference; this is all stuff I find myself saying a lot. Seeing so many of these issues crop up in Anki's database was one impetus for making a case study of it.

  1. Try to name columns (or tables, or documents, or whatever) according to what the thing is. Not its domain, not what you want it some day to be, and not even its intended use, but what it is.
  2. Unless you have a very good reason to denormalize your data, represent a given piece of information at most once in your database.
  3. Unless you really need to optimize something, make a given column represent only one piece of information. (You know those surveys where you rank things on a 1-7 scale, but "4" also means "not applicable"? Do better than that.)
  4. All else equal (and sometimes even if not), use consistent conventions (e.g., for naming, for delimiters, and for UUIDs).

Notes

  1. In the cards and notes tables, the id (primary key) field is the number of milliseconds since the epoch when it was created. There are good reasons not to have semantically meaningful information in the id field: in this case, it is entirely possible for several cards to be created at the same millisecond. (The foreseen use case is that of a single human creating cards manually, but what if the system scales to many users with many cards? What if cron jobs that go off at midnight generate new cards? What if cards are created from timestamps that only have second-, hour-, or day-level resolution?)
  2. In the cards table, the type and queue columns duplicate information and both express what I'd be tempted to call a "learning status" (whether the card is new, is being reviewed, is buried, and so on). The database would be cleaner with these collapsed into a single column with a better name.
  3. In the cards table, several columns have integers representing times or durations for which the units are different depending on some other feature of the card. For example, the due column is an integer representing either a number of days or a number of seconds, depending on the card status. I could imagine a case for splitting this out into separate columns. It's also worth considering not explicitly storing the data at all, because it's a function of other data (and the learning algorithm). Because most decks are not huge, it could probably could be calculated on demand (or, if it really needs to be precomputed, cached elsewhere).
  4. The cards table has a reps column indicating its number of repetitions. This duplicates information from the revlog table. One could imagine a situation where information needs to be denormalized in this way, but (i) this system is clearly not designed for the sort of scale that would require that (see other notes) and (ii) if so, it would probably be better to keep "source of truth" cards data in a separate table and generate separate denormalized views or tables. Similar comments apply to the lapses column.
  5. The flags field is only significant mod 8: mod-8 values correspond to various colors of flags or no such flag. This rules out the possibility either of a card's having several flags set or of there being more than 8 flag-states. Moreover, coders accustomed to "flags" fields will likely anticipate a system where bits of a binary representation correspond to flags' being set. Given that this is an integer field in either case, little would be lost and much gained by adopting a more standard system for representing flag values.
  6. There is a graves table representing cards, notes, and decks that have been deleted. This is awkward. First, it combines heterogeneous data (data that are different not only in representing different categories of items but also in actually being stored in separate tables in this database when learning circumstances are different). Second, it amounts to moving data from one table to another in the case of a status change; it is usually better to simply make such a change like that in existing status fields. (I often find that solutions like this are attempted compromises by programmers who are pretty sure, but not completely sure, that data should be deleted. It is almost always better to settle that question once and for all--by literally deleting it, by leaving it, or by sending it off to some archive or backup--than to attempt this kind of compromise.)
  7. The relationship between notes and tags is represented by a text column on notes with a list of tag names, both separated by spaces and with spaces at the end (in order to support LIKE "%tag%" queries, according to the Ankidroid documentation linked above). This is an error-prone way to represent a many-to-many relationship. I'm not an SQLite expert, but I'd imagine it's inefficient, too. More fundamentally, tags are clearly domain objects that should be properly represented (and probably need their own table). Whenever I've tried to keep track of objects like this ad hoc as strings--and I've tried this many times!--I've regretted it. (Do as I say, not as my past self has done.)
  8. flds (representing a many-to-many relationship between notes and fields) is represented similarly to tags, but separated by 0x1f. The comments above apply, and there's an extra bit of potential confusion because the delimiter is different here. An aside: in this case, using flds instead of fields appears to be abbreviation for abbreviation's sake, and nothing more. But I often find that abbreviations like this, especially when nearby objects are not abbreviated, correspond to misalignments between the abbreviated word and the data object. So, for example, tmstp might abbreviate "timestamp" but actually mean something closer to "a timestamp but not really." When you read code, get in the habit of asking whether abbreviated names are subconscious (or conscious) expressions of the fact that the name isn't quite right.
  9. The csum ("checksum") field on notes is (apparently) intended for uniqueness checks. (It's not clear that this is literally a checksum; see point (8).) This function would be better served by using built-in mechanisms for creating unique ids. (More generally, database software is pretty good at primary key management and generation. Programming is hard enough; leave this sort of thing to the software.)
  10. In the revlog table, the time field represents a duration, is capped at 60s, and is represented in milliseconds. It's usually best not to throw away information. (Are you so sure you won't ever care about the difference between 65 and 75 seconds?) Moreover, this is not a time but a duration and should be so named. And it would probably be best simply to store the beginning and end timestamps of the review and let the calculations be done downstream: this is a subtle form of storing several pieces of information together.

Home page