On improving data-adjacent code

Here's the most stimulating bit of feedback from the post about Anki's database:

My impression is most databases are designed fairly haphazardly, and while tables may have fields added, the entity relationship diagram stays pretty static. My rationale is as follows: (1) A database schema requires quite a bit of thought and up-front design, which we're loathe to do in this agile world; (2) Schema migrations are hard and scary, so we avoid doing them.

--LAC-Tech on Hacker News (lightly edited)

So: we get into the habit of defining structures on the assumption we can change them easily, but when databases are involved we can't change them so easily. This is an underrated problem; it's both technical and cultural. Here's how I try to mitigate it:

  1. Think about persistence from the beginning, but delay actually persisting things until later. Use in-memory / mock versions of your persistence layer until you've had a chance to learn more about your objects.
  2. Have ready sources of test data, usually flat files but sometimes functions that make mock data. Maintain functions that transform that test data into your current object representations, so that you don't have to change the underlying test data every time you change your object representation. (It's usually a lot easier, both mechanically and psychologically, to change those functions than the underlying data.)
  3. Make the in-memory / mock versions of your persistence layer obey the same interface as your database-facing code will. This means that other parts of your code base should not need to know or care about persistence details. (The Domain-Driven Design folks are good at this.)
  4. Make utilities that test, for anything implementing the interface described in (3), whether the data conforms to the current schema. Dashboards are your friend: it's usually not hard to calculate and display lightweight analytics. Having a ready way to display a human-readable view of data samples is often invaluable.
  5. Add tests that feed the data from (2) through whatever CRUD functions you're building, retrieve that data through the interface described in (3), and test the data with the utilities in (4).

Different projects are different enough that these steps look different whenever you implement them. A common flow, however, is to have some utility function loop through test/data/test_widgets.csv, create a bunch of Widget objects, create a new widget persister and apply the current schema to it, pass the Widget objects into widget_persister.add_widget(), call widget_persister.get_all_widgets(), and run some tests on the result.

After I sketch out a design, I often build out parts of it roughly the way I'd implement a sand castle: by starting to put sand wherever I want there eventually to be castle. (It's not too hard to touch things up along the way, and the best way to know what to fix is to have built something that shows me what's wrong.) Persistence-facing parts work differently: it's the part of the sand castle where you have to leave the upside-down bucket in place of the tower for a while and just built around it.

This can be tricky, but it has clear benefits: it forces you to focus, for a while at least, on the interface the persistence layer will obey more than its implementation. Most of us don't think about interfaces enough, so this makes for great practice.


More posts about programming


Home page