I'm going deep on census data for the first time - the "Summary File 1" product in particular. It's basically the world's gnarliest pivot table.


There's also shockingly little high-quality, structured metadata documentation. Given its centrality and the number of folks who use it, I was expecting a more mature, and modern, set of tools. Plenty of libraries to make specific tasks easier, but no good query layer?

The hypothesis - one I plan to test - is that this stems from lack of a business model. Why go through the effort to improve things only to give it away? So, this work has probably been done 100s or 1000s of times in private.

Is the census “big data”? It’s certainly unwieldy as distributed - it can feel like something that requires institutional resources to handle. But it’s only ~1TB, and tabular, and fully structured, which makes me think there’s got to be a Better Way.

