Anyone who has spent the last year playing with ChatGPT, Gemini, Claude or otherwise knows the double act: a model that can answer exam questions and write boilerplate code at lightning speed, yet just as easily makes up a policy for your own business. When the response looks polished, it can be tempting to believe it. That is why a recent paper, AA‑Omniscience , fascinated me so much. The authors built a benchmark of 6,000 questions across 42 economically significant topics