Nate Meyvis

How much data do you really have?

You're starting a project. You are responsible, so you design the new system with its data's scale in mind. You look in the relevant buckets for the current version of the data you're designing for, and they contain a total of 100 GB. Maybe that will grow over time, so you plan to design as if you need to handle hundreds of GB.

You might have vastly overestimated the size of your data:

  1. There are often huge, readily corrected inefficiencies in the storage of the data. Data that are intrinsically numeric, boolean, or datetime are often stored as VARCHAR columns or strings.
  2. A lot of real-world data is categorical data that is also stored as a string. If 20% of your data is strings like FEDORA (2023 WINTER REVISION), when there are only hundreds of distinct hats, you can do much better at making your data both smaller and less brittle in storage
  3. These data dumps often contain redundant data.
  4. There is often easily trimmed extra data that won't be needed in the future state of the system.

After a decade-plus of seeing all sorts of stored data, I'd conjecture that most legacy business data is 2x to 5x as large as it would be if:

  1. Cursory attention were paid to storage efficiency;
  2. Obviously irrelevant data were discarded;
  3. Reasonable attempts were made not to store the same data several times.

This matters because properly stored data is more likely not to be surrounded by bugs, and cutting down data by 2x to 5x often unlocks very different system behaviors. It can make common query patterns near-instantaneous against up-to-date data; it can let you handle all your data in AWS Lambda (or analogous tools); it can unlock awesome front-end experiences. (It also lets you save money on storage, but that usually matters a lot less.)

If nothing else, please remember that "how big is this data set?" is a subtler question that it seems; that people systematically overestimate the size of their data; and that at ordinary scale, very modest optimizations might unlock much better system designs.