Nate Meyvis

The dissertation benchmark

It's useful to have some personal, non-quantitative benchmarks to help you assess how AI is progressing. (Here is Simon Willison discussing his wonderful pelican benchmark.) It's useful for a few reasons:

  1. There are different and useful kinds of quality judgments that you can make on this kind of non-quantitative benchmark.
  2. It gives a vivid sense of AI's progress.
  3. It's fun.

My benchmark is uploading my dissertation and asking it to think hard and give me feedback.1 Today's results:

ChatGPT 5.2: Gives strikingly good comments in less than a minute. They're a bit predictable and surface-level in places, but then again a dissertation should be pretty robust against predictable, surface-level criticism.

Claude Opus 4.5: More detailed comments, with better research into where the dissertation makes its strongest and weakest engagements with secondary literature. On some subjects ChatGPT was more perceptive and useful.

Neither of these can replace an actual committee, but each would supplement it extremely well. Pretty-good-to-very-good comments, six orders of magnitude faster and very cheap, is an invaluable resource.

By the way: a benchmark should ideally be pretty easy to evaluate, and this is quite hard to evaluate. That's a weakness of the dissertation benchmark. But it's genuinely illuminating and a clear reminder that AI is doing things that I recently would have thought impossible.

  1. Today's prompt: "Please SUPERTHINK about this dissertation, do lots of research, and give me feedback the way a dissertation committee would." I try to give a few related prompts each time.

#academia #epistemology #generative AI #software