- For Cantrill, debugging is the process of asking questions of a system (and not, primarily, attempting to fix a system). This reminds me of Michael Feathers' view of testing as primarily about describing what a program does (and not, primarily, of enforcing its correctness).
- Therefore, debuggability is the propensity of a system to have questions asked of it. I think about this in Socratic terms. When you're doing intellectual work, you of course need to have ideas to talk about. But you also need to be in the right condition to pose and answer questions in light of those ideas. That system of questioning is the more fundamental thing. (In grade school I was taught that The Scientific Method requires one to formluate hypotheses strictly before looking at data. This view seems totally archaic.)
- Relatedly: suppose X is causing an incident in a distributed system. Given that, you'd like to have an alert for X. It does not follow that you should have tons of alerts all over the place! Humans are not very good at inferring from alerts to causes in the first place, and this ability degrades quickly with the number of alerts to process (video link). (Note that this is not quite how Cantrill puts the point.)
- We can learn about distributed systems incidents by looking to history, and not just the history of computer distributed systems. (Cantrill has a useful discussion of Three Mile Island.)
Some people react to intellectual challenges by reciting everything they know; others are great interlocutors. Whom would you rather be working with in an emergency?