Category Archives: Mainstreaming Information

OMFG! I can’t believe this actually works!

Book grid sorted by word count

I made a Flex application!!!1111!!!!

Okay, so what is this thing?

Well, each green block represents the total word count of a book; that count is also given by the big number beneath each book. The orange stripe represents the number of unique words in the book, case insensitive (so The and the count as one). It’s all a continuation of this data visualization project I’ve been working on all semester.

And what’s it made of? Aside from snips, snails, and puppydog tails, it comprises some really stupid code, four static XML docs, some CSS, and one . . . class? I think it’s what passes for a class in Flex. God only knows. Here’s the filthy, embarrassing source code.

Yes, I am well aware that this is some of the most fucked-up, redundant, unnecessarily hard-coded shit you’ve ever seen, and that the radio buttons don’t work right on the first click, but considering that I only started learning Flex on, like, Thursday, and that I didn’t start trying to code this thing in earnest until the wee hours of Monday morning, I think it’s Oh. Kay.

And, yeah, no, I couldn’t figure out how to get rid of the gap between the orange and green blocks. CSS in Flex is really weird and undernourished.

Next steps:

  • Make this code not suck.
  • Add more data (I’ve got about 20 more books on hand to process, and then it’s time to hit Bittorrent).
  • Make the code for pulling out the word counts not suck. Right now it’s case-sensitive, which I don’t want it to be (I’ve been batch-converting the text to lowercase before processing it), and I’ve been manually deleting all words that start with numbers or that look likely to be roman numerals. I’ve also been doing this from Terminal, one file at a time, when it really ought to be able to process in batches. This proves that I am not nearly lazy enough, otherwise I would have dealt with this weeks ago, in order to spare myself a lot of tedious busywork.
  • Make the design not suck (e.g., get rid of those gaps, and replace the radio buttons with something less nasty).
  • Add more views—for example, something should actually happen when you mouse over or click on a book thumbnail, besides it lighting up in hideous powder blue. There are a lot more ways I want to slice up this data, and I still want to be able to compare books or sets of books. That will require building the word-counting code into the Flex app somehow.

But in the meantime, w000t! It works!


date merchant

As you may recall, for my midterm project, I got stumped on several seemingly simple tasks. One of those—the most important, since upon it depends my semester-long assignment for Mainstreaming Information—was figuring out a way to compare one list of words to another and pull out the words that were unique to one of those lists. In my head, I can see very easily how this would be done. Given my special way of haphazardly flailing through code, however, I just couldn’t get it to work.

Until today!

In fiddling with the Bayesian comparison code for this week’s homework, I finally pulled out a list of unique words. Of course, this is a completely perverse misuse of that code—like using a steamroller to kill a pillbug—but as long as it works, I don’t fucking care.

So, here’s what I did. In, I replaced the last two for loops with the following:

[java]for (String word: uniqueWords)
for (BayesCategory bcat: categories)
double wordProb = bcat.relevance(word, categories);
if (wordProb < 1) { println(word); } else {} } // end for bcat } // end for word for (BayesCategory bcat: categories) { double score = bcat.score(uniqueWords, categoryWordTotal); println("---The following words were not found in " + bcat.getName()); } // end for bcat[/java] And in I replaced the percentage and relevance blocks with [java] public double percentage(String word) { if (count.containsKey(word)) { return count.get(word); } // end if else { return 0.001; } // end else } // end percentage public double relevance(String word, ArrayList categories)
double percentageSum = 0;
for (BayesCategory bcat: categories)
percentageSum += bcat.percentage(word);
} // end for bcat
return percentage(word);
} // end relevance[/java]

So now, if I run the command

$ java BayesClassifier A2_unique.txt < B1_unique.txt | sort >results.txt

I get a list of words that are in B1_unique.txt (The Masada Scroll by Paul Block and Robert Vaughan, 2007) but not in A2_unique.txt (Zuleika Dobson or, An Oxford Love Story by Max Beerbohm, 1911). For example,

Akbar, Allah, Allahu, Apostolic, Ariminum, Arkadiane, Asmodeus, Astaroth, Barabbas, Beelzebub, Bellarmino, Blavatsky, Brandeis, Breviary, Byzantine, Caiaphas, Calpurnius, Catacombs, Charlemagne, Clambering, DNA, Diavolo, Franciscan, Freemasons, GPS, Gymnasium, Haddad, Hades, IDs, IRA, Jettisoning, Kathleen, Lefkovitz, MD, MRI, Masada, Masonic, Muhammad, Muhammadan, Nazarene, Nazareth, Olympics, Orthodoxy, Palatine, Palazzi, Palestine, Palestinian, Palestinians, Petrovna, Pleasant, Plenty, Plunge, Pocketing, Pontiff, Pontifical, Pontius, Praetorian, Prissy, Professors, Protestants, Rasulullaah, Ratsach, Revving, Rosicrucians, Satan, Scrolls, Seder, Shakespeare, Syracuse, Tacitus, Theosophical, Torah, Trastevere, Turkish, USB, Uzi, VAIO, VCR, Yeah, Yechida, Yeetgadal, Yiddish, adrenalin, agita, airliner, airport, ankh, awesome, bitch, bomb, bookstores, braked, breastplate, briefcase, broadsword, broiler, brotherhood, bulrushes, cellular, checkpoint, chuckling, chutzpah, combatant, computer, dashboard, database, departmental, desktop, divorce, dysentery, electricity, enabling, entrepreneurs, firearms, firestorm, fishtailed, flagon, forensics, goatskin, groggily, gunfire, gunman, gunshots, handbag, handball, handbrake, handgun, helicopter, helmets, highwaymen, hijinks, homeland, homeless, homespun, hometown, innkeeper, internship, journalist, kebob, kidnappers, kilometers, lab, laptop, lyre, mawkish, monitor, muezzin, nickname, nightfall, nonbeliever, northeaster, notebook, notepad, notepaper, numerology, paganism, password, pastries, phone, photo, photocopies, photocopy, photograph, photos, pig, pigeons, pistol, playback, police, quintessentially, recycles, redialed, roadblock, roadway, sandwich, screensaver, site, sites, submachine, superheating, synagogue, taped, taxi, terrorism, terrorist, terrorists, thousandfold, thrashing, toga, tortured, trigonometry, universe, unto, vegetables, vehicles, video, videotape, vinegar, violence, warehouses, waterfall, welfare, wholeheartedly, whoosh, whore, windshield, worker, workstation, worldwide, yardstick, yarmulkes, yeetkadash, zooming

And if I run the comparison in the opposite direction, I come up with words such as

Abernethy, Abiding, Abimelech, Abyssinian, Academically, Academy, Accidents, Achillem, Adam, Adieu, Admirably, Age, Agency, Agents, Alas, Albert, Alighting, America, Atlantic, Australia, Balliol, Baron, Baronet, Britannia, Broadway, Brobdingnagian, Colonials, Cossacks, Crimea, Devon, Dewlap, Duchess, Duke, Dukedom, Earl, Edwardian, Egyptians, Elizabethan, Englishmen, Englishwoman, Europe, Holbein, Ireland, Iscariot, Isis, Japanese, Kaiser, Liberals, London, Madrid, Meistersinger, Messrs, Monsieur, Napoleon, Novalis, Papist, Parnassus, President, Prince, Professor, Prussians, Romanoff, Segregate, Slavery, Socrates, Switzerland, Tzar, Victoria, Wagnerian, Waterloo, Whithersoever, Zeus, absinthes, acolyte, adventures, affrights, affront, afire, afoot, aforesaid, aggravated, album, analogy, anarchy, ankle, ape, aright, aristocracy, ataraxy, automatically, avalanche, avow, balustrade, bandboxes, bank, beastliest, beau, beauteous, billiards, biography, bodyguard, bosky, boyish, broadcast, bruited, bulldog, businesslike, bustle, calorific, casuistry, catkins, chaperons, chidden, cigarettes, clergyman, cloven, comet, compeers, coquetry, cricket, crinolines, custard, dandiacal, dapperest, decanter, devil, dialogue, diet, dipsomaniacal, disemboldened, disinfatuate, drunken, ebullitions, equipage, exigent, eyelashes, eyelids, farthingales, female, femininity, fishwife, fob, forefather, forerunners, freemasonry, furbelows, gallimaufry, goodlier, gooseberry, gorgeous, gypsy, haberdasher, halfpence, handicapped, handicraft, handiwork, handwriting, hearthrug, helpless, hip, hireling, honeymoon, housemaid, housework, hoyden, hussy, idiotic, impertinent, impudence, inasmuch, incognisant, insipid, insolence, insouciance, item, keyboard, landau, legerdemain, loathsome, luck, maid, maidens, manhood, manumission, matador, maunderers, model, mushroom, nasty, newspaper, noodle, nosegay, novel, oarsmen, omnisubjugant, ostler, otiose, parasol, pinafore, poetry, poltroonery, postprandially, prank, prestidigitators, propinquity, queer, romance, sackcloth, salad, sardonic, saucy, schoolmaster, seraglio, sex, skimpy, skirt, snuff, socialistic, streetsters, surcease, surcoat, swooned, teens, telegram, telegraphs, thistledown, thither, thou, threepenny, tomboyish, toys, tradesmen, treacle, ugly, uncouthly, unvexed, vassalage, waylay, welter, wigwam, witchery, withal, woe, woebegone, womanly, womenfolk, wonderfully, wonderingly, wretchedness, wrought, yacht, yesternight, zounds


A2Z midterm: Vocabu-lame

vocabulap, slide 7

Apparently, I have learned absolutely nothing all semester, because what seemed like a very straightforward project proved to be completely beyond my abilities.

The overarching goal is to generate data for the visualization I’m making for Lisa Strausfeld and Christian Marc Schmidt’s Mainstreaming Information class. The following are some slides explaining the gist of the project, provisionally called Vocabulap (vocabulary + overlap; not a handsome coinage):

My specific goals for the A2Z midterm were as follows (with subsequent comments in all caps):

For A2Z midterm
* Remove all blank lines
* Remove all extra spaces
* Break all lines – DONE
* Rename all to number consecutively: A01, A02, . . . A10 (for old books); B01, B02, . . . B10 (for new books)

Compare major sets
* Extract the text from between the body tags in each file. Dump it out as a new file with the extension body.txt in the folder ../body.
* Concatenate all the files in each set.
* Make a list of unique words in each concatenated set, with the number of times the word appears.
* Strip out all words beginning with numerals.
* Create the following lists:
– Words shared by both major sets, with frequency counts
– Words unique to set 1, with frequency counts
– words unique to set 2, with frequency counts

Find unique words in each book
For each book:
* Concatenate all the files in that major set *except* the file for that book.
* Make a list of the unique words, with frequency counts, in
– the current book
– the set of all books except the current one
* Make three lists:
– Words shared by all books in the major set, with frequency counts
– Words that appear only in the current book, with frequency counts
– Words that appear only outside the current book, with frequency counts

Return lines surrounding specific words
For each word in a given list:
* Get the line numbers on which it appears.
For each appearance,
* Print the line above
* Print the line with the word, replacing it with itself wrapped in span tags to apply color
* Print the line below

The most essential piece of code that I could not get working is the comparison doodad. It almost worked for, like, five seconds, but it was generating a huge file of every unique word times however many words were in the document, or something like that. When I tried to fix it, it completely stopped working. The offending code is as follows:

/* 1. Takes in a file name from the command line.
2. Makes a string array out of the hard-coded comparison file.
3. Imports the contents of the file whose name was passed in.
4. For each line of the input file (i.e., each word), changes it to
lowercase and checks to see if it’s contained in the comparison file.
5. If it’s not in the comparison file, checks to see if it’s in a hashset of
unique words.
6. If the word’s not in the hashset, add it.
7. Print the contents of the hashset.

import java.util.ArrayList;
import java.util.HashSet;
import com.decontextualize.a2z.TextFilter;

public class CompareUnique extends TextFilter
public static void main(String[] args)
new CompareUnique().run();
} // end main

private String filename = “body/unique/allB_uci.txt”;
private HashSet uniqueWords = new HashSet();
private HashSet lowercaseWords = new HashSet();

// make a String array out of the contents of the comparison file
String[] checkAgainst = new TextFilter().collectLines(fromFile(filename));

public void eachLine(String word)
String wordLower = word.toLowerCase();
for (int i = 0; i < checkAgainst.length; i++)
if (checkAgainst[i] != null && checkAgainst[i].contains(wordLower))
{} // end if
else if (checkAgainst != null)
if (lowercaseWords != null && lowercaseWords.contains(wordLower))
{ } // end if
else if (lowercaseWords != null)
} // end else
} // end else
} // end for
} // end eachLine

public void end()
for (String reallyunique: uniqueWords) {
} // end for
} // end end

} // end class

I know, it seems very simple, but you have no idea how long it took me to get this far.

So, basically, for the midterm I’ve got bupkis—just a big pile of text files, and a list of unique words for each.