Category Archives: Bookalator

Revisiting book data

For Crafting with Data today, our assignment was to gather preliminary data for our Discovery Seeker project:

Use what you’ve learned to gather sufficient samples for your purpose. Explore again, check for significances, patterns, correlations or trends to share with the class.

Since my project is to try to approach the abandoned Bookalator from a new angle, I didn’t actually set out to collect new data right away. Continue reading Revisiting book data

The Project That Would Not Die

Vesalius spread

Hallowe’en is coming, which must be why the Bookalator has been stirring again, gathering its strength to once more stalk the earth.

Back in May, I made a summer “to do” list which was long enough to keep me busy for the next ten years. Among the proposed tasks was “finishing” the perpetually stalled Bookalator project. I then, of course, proceeded to spend the next three months mostly sitting around feeling bad about not doing stuff. But the constant demand for Ideas, Ideas, Ideas, and my constant lack of same, has made spring’s doleful mess look like fall’s Golden Ticket.
Continue reading The Project That Would Not Die

OMFG! I can’t believe this actually works!

Book grid sorted by word count

I made a Flex application!!!1111!!!!

Okay, so what is this thing?

Well, each green block represents the total word count of a book; that count is also given by the big number beneath each book. The orange stripe represents the number of unique words in the book, case insensitive (so The and the count as one). It’s all a continuation of this data visualization project I’ve been working on all semester.

And what’s it made of? Aside from snips, snails, and puppydog tails, it comprises some really stupid code, four static XML docs, some CSS, and one . . . class? I think it’s what passes for a class in Flex. God only knows. Here’s the filthy, embarrassing source code.

Yes, I am well aware that this is some of the most fucked-up, redundant, unnecessarily hard-coded shit you’ve ever seen, and that the radio buttons don’t work right on the first click, but considering that I only started learning Flex on, like, Thursday, and that I didn’t start trying to code this thing in earnest until the wee hours of Monday morning, I think it’s Oh. Kay.

And, yeah, no, I couldn’t figure out how to get rid of the gap between the orange and green blocks. CSS in Flex is really weird and undernourished.

Next steps:

  • Make this code not suck.
  • Add more data (I’ve got about 20 more books on hand to process, and then it’s time to hit Bittorrent).
  • Make the code for pulling out the word counts not suck. Right now it’s case-sensitive, which I don’t want it to be (I’ve been batch-converting the text to lowercase before processing it), and I’ve been manually deleting all words that start with numbers or that look likely to be roman numerals. I’ve also been doing this from Terminal, one file at a time, when it really ought to be able to process in batches. This proves that I am not nearly lazy enough, otherwise I would have dealt with this weeks ago, in order to spare myself a lot of tedious busywork.
  • Make the design not suck (e.g., get rid of those gaps, and replace the radio buttons with something less nasty).
  • Add more views—for example, something should actually happen when you mouse over or click on a book thumbnail, besides it lighting up in hideous powder blue. There are a lot more ways I want to slice up this data, and I still want to be able to compare books or sets of books. That will require building the word-counting code into the Flex app somehow.

But in the meantime, w000t! It works!

Comparalator

date merchant

As you may recall, for my midterm project, I got stumped on several seemingly simple tasks. One of those—the most important, since upon it depends my semester-long assignment for Mainstreaming Information—was figuring out a way to compare one list of words to another and pull out the words that were unique to one of those lists. In my head, I can see very easily how this would be done. Given my special way of haphazardly flailing through code, however, I just couldn’t get it to work.

Until today!

In fiddling with the Bayesian comparison code for this week’s homework, I finally pulled out a list of unique words. Of course, this is a completely perverse misuse of that code—like using a steamroller to kill a pillbug—but as long as it works, I don’t fucking care.

So, here’s what I did. In BayesClassifier.java, I replaced the last two for loops with the following:

for (String word: uniqueWords) 
    {
      for (BayesCategory bcat: categories) 
      {
        double wordProb = bcat.relevance(word, categories);
        if (wordProb < 1)
        {
        println(word);
        }
        else {}
      } // end for bcat
    } // end for word

    for (BayesCategory bcat: categories) 
    {
      double score = bcat.score(uniqueWords, categoryWordTotal);
      println("---The following words were not found in " + bcat.getName());
    } // end for bcat

And in BayesCategory.java I replaced the percentage and relevance blocks with

 public double percentage(String word) 
  {
    if (count.containsKey(word)) 
    {
      return count.get(word);
    } // end if
    else 
    {
      return 0.001;
    } // end else
  } // end percentage

  public double relevance(String word, ArrayList<BayesCategory> categories) 
  {
    double percentageSum = 0;
    for (BayesCategory bcat: categories) 
    {
      percentageSum += bcat.percentage(word);
    } // end for bcat
    return percentage(word);
  } // end relevance

So now, if I run the command

$ java BayesClassifier A2_unique.txt < B1_unique.txt | sort >results.txt

I get a list of words that are in B1_unique.txt (The Masada Scroll by Paul Block and Robert Vaughan, 2007) but not in A2_unique.txt (Zuleika Dobson or, An Oxford Love Story by Max Beerbohm, 1911). For example,

Akbar, Allah, Allahu, Apostolic, Ariminum, Arkadiane, Asmodeus, Astaroth, Barabbas, Beelzebub, Bellarmino, Blavatsky, Brandeis, Breviary, Byzantine, Caiaphas, Calpurnius, Catacombs, Charlemagne, Clambering, DNA, Diavolo, Franciscan, Freemasons, GPS, Gymnasium, Haddad, Hades, IDs, IRA, Jettisoning, Kathleen, Lefkovitz, MD, MRI, Masada, Masonic, Muhammad, Muhammadan, Nazarene, Nazareth, Olympics, Orthodoxy, Palatine, Palazzi, Palestine, Palestinian, Palestinians, Petrovna, Pleasant, Plenty, Plunge, Pocketing, Pontiff, Pontifical, Pontius, Praetorian, Prissy, Professors, Protestants, Rasulullaah, Ratsach, Revving, Rosicrucians, Satan, Scrolls, Seder, Shakespeare, Syracuse, Tacitus, Theosophical, Torah, Trastevere, Turkish, USB, Uzi, VAIO, VCR, Yeah, Yechida, Yeetgadal, Yiddish, adrenalin, agita, airliner, airport, ankh, awesome, bitch, bomb, bookstores, braked, breastplate, briefcase, broadsword, broiler, brotherhood, bulrushes, cellular, checkpoint, chuckling, chutzpah, combatant, computer, dashboard, database, departmental, desktop, divorce, dysentery, electricity, enabling, entrepreneurs, firearms, firestorm, fishtailed, flagon, forensics, goatskin, groggily, gunfire, gunman, gunshots, handbag, handball, handbrake, handgun, helicopter, helmets, highwaymen, hijinks, homeland, homeless, homespun, hometown, innkeeper, internship, journalist, kebob, kidnappers, kilometers, lab, laptop, lyre, mawkish, monitor, muezzin, nickname, nightfall, nonbeliever, northeaster, notebook, notepad, notepaper, numerology, paganism, password, pastries, phone, photo, photocopies, photocopy, photograph, photos, pig, pigeons, pistol, playback, police, quintessentially, recycles, redialed, roadblock, roadway, sandwich, screensaver, site, sites, submachine, superheating, synagogue, taped, taxi, terrorism, terrorist, terrorists, thousandfold, thrashing, toga, tortured, trigonometry, universe, unto, vegetables, vehicles, video, videotape, vinegar, violence, warehouses, waterfall, welfare, wholeheartedly, whoosh, whore, windshield, worker, workstation, worldwide, yardstick, yarmulkes, yeetkadash, zooming

And if I run the comparison in the opposite direction, I come up with words such as

Abernethy, Abiding, Abimelech, Abyssinian, Academically, Academy, Accidents, Achillem, Adam, Adieu, Admirably, Age, Agency, Agents, Alas, Albert, Alighting, America, Atlantic, Australia, Balliol, Baron, Baronet, Britannia, Broadway, Brobdingnagian, Colonials, Cossacks, Crimea, Devon, Dewlap, Duchess, Duke, Dukedom, Earl, Edwardian, Egyptians, Elizabethan, Englishmen, Englishwoman, Europe, Holbein, Ireland, Iscariot, Isis, Japanese, Kaiser, Liberals, London, Madrid, Meistersinger, Messrs, Monsieur, Napoleon, Novalis, Papist, Parnassus, President, Prince, Professor, Prussians, Romanoff, Segregate, Slavery, Socrates, Switzerland, Tzar, Victoria, Wagnerian, Waterloo, Whithersoever, Zeus, absinthes, acolyte, adventures, affrights, affront, afire, afoot, aforesaid, aggravated, album, analogy, anarchy, ankle, ape, aright, aristocracy, ataraxy, automatically, avalanche, avow, balustrade, bandboxes, bank, beastliest, beau, beauteous, billiards, biography, bodyguard, bosky, boyish, broadcast, bruited, bulldog, businesslike, bustle, calorific, casuistry, catkins, chaperons, chidden, cigarettes, clergyman, cloven, comet, compeers, coquetry, cricket, crinolines, custard, dandiacal, dapperest, decanter, devil, dialogue, diet, dipsomaniacal, disemboldened, disinfatuate, drunken, ebullitions, equipage, exigent, eyelashes, eyelids, farthingales, female, femininity, fishwife, fob, forefather, forerunners, freemasonry, furbelows, gallimaufry, goodlier, gooseberry, gorgeous, gypsy, haberdasher, halfpence, handicapped, handicraft, handiwork, handwriting, hearthrug, helpless, hip, hireling, honeymoon, housemaid, housework, hoyden, hussy, idiotic, impertinent, impudence, inasmuch, incognisant, insipid, insolence, insouciance, item, keyboard, landau, legerdemain, loathsome, luck, maid, maidens, manhood, manumission, matador, maunderers, model, mushroom, nasty, newspaper, noodle, nosegay, novel, oarsmen, omnisubjugant, ostler, otiose, parasol, pinafore, poetry, poltroonery, postprandially, prank, prestidigitators, propinquity, queer, romance, sackcloth, salad, sardonic, saucy, schoolmaster, seraglio, sex, skimpy, skirt, snuff, socialistic, streetsters, surcease, surcoat, swooned, teens, telegram, telegraphs, thistledown, thither, thou, threepenny, tomboyish, toys, tradesmen, treacle, ugly, uncouthly, unvexed, vassalage, waylay, welter, wigwam, witchery, withal, woe, woebegone, womanly, womenfolk, wonderfully, wonderingly, wretchedness, wrought, yacht, yesternight, zounds

Exciting!

A2Z midterm: Vocabu-lame

vocabulap, slide 7

Apparently, I have learned absolutely nothing all semester, because what seemed like a very straightforward project proved to be completely beyond my abilities.

The overarching goal is to generate data for the visualization I’m making for Lisa Strausfeld and Christian Marc Schmidt’s Mainstreaming Information class. The following are some slides explaining the gist of the project, provisionally called Vocabulap (vocabulary + overlap; not a handsome coinage):

My specific goals for the A2Z midterm were as follows (with subsequent comments in all caps):

For A2Z midterm
===============
Prep
—-
* Remove all blank lines
DONE
* Remove all extra spaces
DONE
* Break all lines – DONE
* Rename all to number consecutively: A01, A02, . . . A10 (for old books); B01, B02, . . . B10 (for new books)
DONE

Compare major sets
——————
* Extract the text from between the body tags in each file. Dump it out as a new file with the extension body.txt in the folder ../body.
THIS IS HARDER THAN IT LOOKS (FOR ME, AT LEAST). EASIER TO JUST CUT THEM OFF BY HAND.
* Concatenate all the files in each set.
DID THIS FROM THE COMMAND LINE, USING CAT
* Make a list of unique words in each concatenated set, with the number of times the word appears.
CAN GET THE UNIQUE WORDS, BUT NOT THE COUNT.
* Strip out all words beginning with numerals.
DONE BY HAND
* Create the following lists:
– Words shared by both major sets, with frequency counts
– Words unique to set 1, with frequency counts
– words unique to set 2, with frequency counts
I APPARENTLY CANNOT DO ANY OF THIS.

Find unique words in each book
——————————
For each book:
* Concatenate all the files in that major set *except* the file for that book.
* Make a list of the unique words, with frequency counts, in
– the current book
– the set of all books except the current one
* Make three lists:
– Words shared by all books in the major set, with frequency counts
– Words that appear only in the current book, with frequency counts
– Words that appear only outside the current book, with frequency counts

Return lines surrounding specific words
—————————————
For each word in a given list:
* Get the line numbers on which it appears.
For each appearance,
* Print the line above
* Print the line with the word, replacing it with itself wrapped in span tags to apply color
* Print the line below

The most essential piece of code that I could not get working is the comparison doodad. It almost worked for, like, five seconds, but it was generating a huge file of every unique word times however many words were in the document, or something like that. When I tried to fix it, it completely stopped working. The offending code is as follows:

/*  1. Takes in a file name from the command line.
    2. Makes a string array out of the hard-coded comparison file.
    3. Imports the contents of the file whose name was passed in.
    4. For each line of the input file (i.e., each word), changes it to 
       lowercase and checks to see if it's contained in the comparison file.
    5. If it's not in the comparison file, checks to see if it's in a hashset of 
       unique words.   
    6. If the word's not in the hashset, add it.
    7. Print the contents of the hashset.
*/

import java.util.ArrayList;
import java.util.HashSet;
import com.decontextualize.a2z.TextFilter;

public class CompareUnique extends TextFilter 
{
    public static void main(String[] args) 
    {
        new CompareUnique().run();
    } // end main

    private String filename = "body/unique/allB_uci.txt";
    private HashSet uniqueWords = new HashSet();
    private HashSet lowercaseWords = new HashSet();

    // make a String array out of the contents of the comparison file
    String[] checkAgainst = new TextFilter().collectLines(fromFile(filename));
    
  public void eachLine(String word) 
  {
    String wordLower = word.toLowerCase();
    for (int i = 0; i < checkAgainst.length; i++)
    {
        if (checkAgainst[i] != null && checkAgainst[i].contains(wordLower))
		{} // end if
		else if (checkAgainst != null)
		{
            if (lowercaseWords != null && lowercaseWords.contains(wordLower))
            { } // end if
            else if (lowercaseWords != null)
            {
                uniqueWords.add(wordLower);
                lowercaseWords.add(wordLower);
            } // end else
 		} // end else
    } // end for
  } // end eachLine

  public void end() 
  {
    for (String reallyunique: uniqueWords) {
      println(reallyunique);
    } // end for
  } // end end

} // end class

I know, it seems very simple, but you have no idea how long it took me to get this far.

So, basically, for the midterm I’ve got bupkis—just a big pile of text files, and a list of unique words for each.