For Crafting with Data today, our assignment was to gather preliminary data for our Discovery Seeker project:
Use what you’ve learned to gather sufficient samples for your purpose. Explore again, check for significances, patterns, correlations or trends to share with the class.
Since my project is to try to approach the abandoned Bookalator from a new angle, I didn’t actually set out to collect new data right away. It’s extremely time-consuming, for one thing, and I really shouldn’t keep doing it by hand but should instead devote a few hours to automating the process, like any decently lazy programmer. Instead, I tried to look at the data as if I’d never seen it before—which is almost true, since I’d forgotten what it looked like and what the numbers measured, in the months since I last worked on it. Fortunately, in a rare fit of lucidity, I’d written a ReadMe back in April that explained where the data had come from and how to generate more of it.
So I took the main chunk of data (PDF, 41 KB)—I’d made a few versions, but this is the largest, most accurate set—into NeoOffice (MS Office having forsaken me after the Snow Leopard upgrade) and made the simplest, most straightforward visualization I could think of: scatter plots.
What I see looking at these is that the numbers on the two datasets are very similar. There aren’t enough samples to generalize, though, I think; I really need to automate that number-crunching process. Also, I’d like to see a scatterplot of the top and middle charts combined, with total words represented by the size of the dot and unique words controlling the position. I’d have to do that in Processing, though, I think; don’t think one can do it in a spreadsheet program. I also like the idea of bar-and-whisker charts, which Rob talked about today, but I can’t quite see what would be the best way to use them with this data.
Next steps: automate the number crunching, get a bunch more books, crunch more numbers.