A tradeoff in defining database schemas
Here are some decisions of a sort that comes up frequently:
- You have a
hatstable in your database withuuid,hat_type(e.g.,fedoraortrilby),size, anddress_code_levelcolumns. Now you need to describe fascinators in your database also. They also have sizes and dress code levels, but alsoattachment_type,attachment_position, andhas_veil. Do you (i) create a separatefascinatorstable or (ii) add some columns to yourhatstable, rename it tomillinery, and accept that fields likeattachment_typecan be null in some cases but not others? - You need to store user settings. Right now the only settings are a preferred time zone and a preferred theme (dark or light mode), but some day you'll probably need more. Do you (i) have columns for
preferred_time_zoneandpreferred_theme, and accept that you'll need to change the schema later to support more settings, or (ii) put settings in asettingsJSON1 column? - You are collecting data about hat interactions. Sometimes people tip their hats, sometimes they put them on, sometimes they take them off, and so on. These events have some, but only some, overlapping fields. Do you (i) try to cluster like events in tables like
hat_tipsandhat_on_events, so that events in those tables have near-identical or identical attributes, or (ii) put all of them in aneventstable?
Call an approach that favors (i)-type answers "table faithfulness"2 and an approach that favors (ii)-type answers "table minimalism." The first approach tends to respect the semantics of table names and columns better, and to give values (including nulls) more consistent meaning between rows. The second approach tends to make it easier to know where data is and to process data without lots of joins and table lookups.
Any specific decision will, of course, depend on your specific domain and tooling, but I tend to value table faithfulness and (i)-type solutions more than other people do:
- When the meaning of a column is consistent for rows in a database, you can more easily do integrity checks (and especially programmatic integrity checks). For example, you can raise an alarm if two columns are null at the same time.3
- Table and column names are forms of documentation, and extremely commonly referenced ones at that. The more semantically accurate they are, the better.
- Many database features (indexes, range queries, grouping, and so on) assume, or at least work better when, you've prioritized table faithfulness.
- Table faithfulness discourages practices like stuffing extra fields into
settingscolumns haphazardly and keeping business logic in developers' heads. - Modern tooling makes it easier to do migrations and learn what tables and columns4 exist, even when there are many of them. So, the drawbacks of faithfulness are getting less and less severe.
This is almost always a tradeoff, and sometimes an intermediate or (ii)-type solution is best. Over the years, though, I've very often found myself advocating for more faithfulness, all things considered, and I've rarely regretted that. (Even more often, I've had to deal with the consequences of a bloated "here's where all the main data lives, and please message this one person if you have any questions" table.) So, if you're facing such a decision5 and not sure what to do, please consider this a vote for faithfulness.
P.S.: Ideally, your persistence layer is encapsulated sufficiently well that most of your code never knows or cares which approach you take, but that's another post.
...or JSONB, or something else. This and many other database-specific details can affect your decisionmaking, but not the basic shape of the problem I'm trying to describe.↩
I don't love this name and would be grateful to hear a better one.↩
There are also any number of database-specific mechanisms for this sort of check.↩
Here as elsewhere, I'm using "columns" loosely to include, for example, fields in a schemaless database.↩
Very often, I've found that what is justified retrospectively as some version of table minimalism ("we wanted to keep things simple!" or "we needed to move fast and didn't want to do a migration") was, upon inspection, backed by no explicit decision-making process at all beyond the pull requests implementing it.↩