Lies, damn lies, and statistics

Mainstreaming Information project proposal poster

Today we presented ideas for our semester-long projects in Mainstreaming Information. The assignment, which apparently I was not the only person to be confused by, is over at Christian’s site (PDF, 36 KB).

Last week we had to bring in some “jaw-dropping statistics” to start considering working with, and because I’ve decided that I’ll get more out of the rest of my time at ITP if I keep my schoolwork linked to—duh—stuff I’m actually interested in, I selected a couple of tidbits from Dan Poynter’s mass of book industry statistics

1993–2003: The number of titles published increased 58% while fiction readers declines 14%,
—Malcolm Jones in Newsweek. Sources: NEA and RR Bowker.

2004. 56.6% of adult Americans said they read at least one book, fiction or non-fiction, between August 2001 and August 2002 compared to 60.9% ten years prior.

2002. 57% of the US population read a book. See report.

Most readers do not get past page 18 in a book they have purchased.

—and John Kremer’s Recent Statistics Related to
Book Publishing and Marketing

In a survey of 4,000 adults in the United Kingdom, 55% said “they buy books for decoration, and have no intention of actually reading them.” (Teletext) This is another important reason why your books should be well-designed. They should look good on a buyer’s coffee table, bookshelf, bedside stand, etc.

These served the purpose at hand, but they’re all just isolated data points. So over the weekend I spent several hours digging around for more information, but for none of these could I find enough reliable numbers to support a semester-long project. I was also looking for any compelling information about e-book sales versus print or audio books, and this morning I spent a while rummaging around on TeleRead. They had all sorts of statistics, none of which quite fit my needs, though but did give me a few more ideas about stuff I’d like to have statistics about. So around 11 a.m., with 3.5 hours left until class, I lazytweeted it, as a last resort. And I immediately got a bunch of responses from my nice, nice friends! Erin pointed me to the completely bitchen Book Scraper, from the London Times‘s R&D labs, and reminded me that the New York Times has an API for its best-seller lists.

In the end, I decided I’d better scale down from the macro to the micro view, so that I could use data I might actually get: vocabulary statistics scraped (using my mad new Programming from A to Z skillz) from Project Gutenberg e-books, compared with those from recent Times best sellers. And then I went and found a Jaw-Dropping Statistic (which, not coincidentally, is bullshit; favorite line in the Straight Dope article: “At times it’s been attributed to Gallup polls or even entomologists.”) that went with the data I was planning to gather. Kind of bass-ackwards, but the result is the poster-style project proposal above, which was deemed Not Entirely Stupid during the classroom critique, despite its having been printed way too large, in fifteen 8.5 × 11-inch tiles, and glue-sticked-together in class using the second-worst glue stick in the universe (the worst being the one I had brought from home, which, it turned out, had dried up).

Now, of course, I’m not even sure I can get files of contemporary best sellers to scrape, because of stupid !@%# DRM, so I’m kind of hoping that the Data Fairy will come to my aid. But my project is at least theoretically possible. Developing . . .

Bonus: Find the typo in the poster!

