{"id":551,"date":"2009-03-10T06:30:44","date_gmt":"2009-03-10T11:30:44","guid":{"rendered":"http:\/\/itp.indiamos.com\/blog\/?p=551"},"modified":"2017-08-03T19:07:09","modified_gmt":"2017-08-04T00:07:09","slug":"a2z-midterm-vocabu-lame","status":"publish","type":"post","link":"https:\/\/itp.indiamos.com\/blog\/2009\/03\/10\/a2z-midterm-vocabu-lame\/","title":{"rendered":"A2Z midterm: Vocabu-lame"},"content":{"rendered":"<p><a href='https:\/\/www.slideshare.net\/indiamos\/vocabulap-200903'><img loading=\"lazy\" src=\"https:\/\/i1.wp.com\/itpindia.wordpress.com\/files\/2009\/03\/vocabulap_7.png?resize=450%2C338\" alt=\"vocabulap, slide 7\" title=\"vocabulap, slide 7\" width=\"450\" height=\"338\" class=\"alignnone size-full wp-image-31\" data-recalc-dims=\"1\" \/><\/a><\/p>\n<p>Apparently, I have learned absolutely nothing all semester, because what seemed like a very straightforward project proved to be completely beyond my abilities.<\/p>\n<p>The overarching goal is to generate data for the visualization I&#8217;m making for Lisa Strausfeld and Christian Marc Schmidt&#8217;s <a href=\"https:\/\/web.archive.org\/web\/20110726034855\/http:\/\/www.christianmarcschmidt.com\/NYU2009\/index.html\">Mainstreaming Information<\/a> class. The following are some <a href=\"https:\/\/www.slideshare.net\/indiamos\/vocabulap-200903\">slides<\/a> explaining the gist of the project, provisionally called Vocabulap (<em>vocabulary<\/em> + <em>overlap<\/em>; not a handsome coinage):<\/p>\n<iframe src='https:\/\/www.slideshare.net\/slideshow\/embed_code\/78542486' width='474' height='389' sandbox=\"allow-popups allow-scripts allow-same-origin allow-presentation\" allowfullscreen webkitallowfullscreen mozallowfullscreen><\/iframe>\n<p>My specific goals for the A2Z midterm were as follows (with subsequent comments in all caps):<\/p>\n<blockquote><p>For A2Z midterm<br \/>\n===============<br \/>\nPrep<br \/>\n&#8212;-<br \/>\n* Remove all blank lines<br \/>\n    DONE<br \/>\n* Remove all extra spaces<br \/>\n    DONE<br \/>\n* Break all lines &#8211; DONE<br \/>\n* Rename all to number consecutively: A01, A02, . . . A10 (for old books); B01, B02, . . . B10 (for new books)<br \/>\n    DONE<\/p>\n<p>Compare major sets<br \/>\n&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;<br \/>\n* Extract the text from between the body tags in each file. Dump it out as a new file with the extension body.txt in the folder ..\/body.<br \/>\n    THIS IS HARDER THAN IT LOOKS (FOR ME, AT LEAST). EASIER TO JUST CUT THEM OFF BY HAND.<br \/>\n* Concatenate all the files in each set.<br \/>\n    DID THIS FROM THE COMMAND LINE, USING CAT<br \/>\n* Make a list of unique words in each concatenated set, with the number of times the word appears.<br \/>\n    CAN GET THE UNIQUE WORDS, BUT NOT THE COUNT.<br \/>\n* Strip out all words beginning with numerals.<br \/>\n    DONE BY HAND<br \/>\n* Create the following lists:<br \/>\n    &#8211; Words shared by both major sets, with frequency counts<br \/>\n    &#8211; Words unique to set 1, with frequency counts<br \/>\n    &#8211; words unique to set 2, with frequency counts<br \/>\n    I APPARENTLY CANNOT DO ANY OF THIS.<\/p>\n<p>Find unique words in each book<br \/>\n&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;<br \/>\nFor each book:<br \/>\n* Concatenate all the files in that major set *except* the file for that book.<br \/>\n* Make a list of the unique words, with frequency counts, in<br \/>\n    &#8211; the current book<br \/>\n    &#8211; the set of all books except the current one<br \/>\n* Make three lists:<br \/>\n    &#8211; Words shared by all books in the major set, with frequency counts<br \/>\n    &#8211; Words that appear only in the current book, with frequency counts<br \/>\n    &#8211; Words that appear only outside the current book, with frequency counts<\/p>\n<p>Return lines surrounding specific words<br \/>\n&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;&#8212;<br \/>\nFor each word in a given list:<br \/>\n* Get the line numbers on which it appears.<br \/>\nFor each appearance,<br \/>\n* Print the line above<br \/>\n* Print the line with the word, replacing it with itself wrapped in span tags to apply color<br \/>\n* Print the line below<\/p><\/blockquote>\n<p>The most essential piece of code that I could not get working is the comparison doodad. It almost worked for, like, five seconds, but it was generating a huge file of every unique word times however many words were in the document, or something like that. When I tried to fix it, it completely stopped working. The offending code is as follows:<\/p>\n<p>[java]<br \/>\n\/*  1. Takes in a file name from the command line.<br \/>\n    2. Makes a string array out of the hard-coded comparison file.<br \/>\n    3. Imports the contents of the file whose name was passed in.<br \/>\n    4. For each line of the input file (i.e., each word), changes it to<br \/>\n       lowercase and checks to see if it&#8217;s contained in the comparison file.<br \/>\n    5. If it&#8217;s not in the comparison file, checks to see if it&#8217;s in a hashset of<br \/>\n       unique words.<br \/>\n    6. If the word&#8217;s not in the hashset, add it.<br \/>\n    7. Print the contents of the hashset.<br \/>\n*\/<\/p>\n<p>import java.util.ArrayList;<br \/>\nimport java.util.HashSet;<br \/>\nimport com.decontextualize.a2z.TextFilter;<\/p>\n<p>public class CompareUnique extends TextFilter<br \/>\n{<br \/>\n    public static void main(String[] args)<br \/>\n    {<br \/>\n        new CompareUnique().run();<br \/>\n    } \/\/ end main<\/p>\n<p>    private String filename = &#8220;body\/unique\/allB_uci.txt&#8221;;<br \/>\n    private HashSet uniqueWords = new HashSet();<br \/>\n    private HashSet lowercaseWords = new HashSet();<\/p>\n<p>    \/\/ make a String array out of the contents of the comparison file<br \/>\n    String[] checkAgainst = new TextFilter().collectLines(fromFile(filename));<\/p>\n<p>  public void eachLine(String word)<br \/>\n  {<br \/>\n    String wordLower = word.toLowerCase();<br \/>\n    for (int i = 0; i &lt; checkAgainst.length; i++)<br \/>\n    {<br \/>\n        if (checkAgainst[i] != null &amp;&amp; checkAgainst[i].contains(wordLower))<br \/>\n\t\t{} \/\/ end if<br \/>\n\t\telse if (checkAgainst != null)<br \/>\n\t\t{<br \/>\n            if (lowercaseWords != null &amp;&amp; lowercaseWords.contains(wordLower))<br \/>\n            { } \/\/ end if<br \/>\n            else if (lowercaseWords != null)<br \/>\n            {<br \/>\n                uniqueWords.add(wordLower);<br \/>\n                lowercaseWords.add(wordLower);<br \/>\n            } \/\/ end else<br \/>\n \t\t} \/\/ end else<br \/>\n    } \/\/ end for<br \/>\n  } \/\/ end eachLine<\/p>\n<p>  public void end()<br \/>\n  {<br \/>\n    for (String reallyunique: uniqueWords) {<br \/>\n      println(reallyunique);<br \/>\n    } \/\/ end for<br \/>\n  } \/\/ end end<\/p>\n<p>} \/\/ end class<br \/>\n[\/java]<\/p>\n<p><em>I know, it seems very simple, but you have no idea how long it took me to get this far.<\/em><\/p>\n<p>So, basically, for the midterm I&#8217;ve got bupkis\u2014just a big pile of text files, and a list of unique words for each.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Apparently, I have learned absolutely nothing all semester, because what seemed like a very straightforward project proved to be completely beyond my abilities. The overarching goal is to generate data for the visualization I&#8217;m making for Lisa Strausfeld and Christian Marc Schmidt&#8217;s Mainstreaming Information class. The following are some slides explaining the gist of the &hellip; <a href=\"https:\/\/itp.indiamos.com\/blog\/2009\/03\/10\/a2z-midterm-vocabu-lame\/\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">A2Z midterm: Vocabu-lame<\/span> <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":6,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","jetpack_publicize_message":"","jetpack_is_tweetstorm":false},"categories":[26,46,35,36,16,10],"tags":[],"jetpack_featured_media_url":"","jetpack_publicize_connections":[],"jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p3qY10-8T","_links":{"self":[{"href":"https:\/\/itp.indiamos.com\/blog\/wp-json\/wp\/v2\/posts\/551"}],"collection":[{"href":"https:\/\/itp.indiamos.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/itp.indiamos.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/itp.indiamos.com\/blog\/wp-json\/wp\/v2\/users\/6"}],"replies":[{"embeddable":true,"href":"https:\/\/itp.indiamos.com\/blog\/wp-json\/wp\/v2\/comments?post=551"}],"version-history":[{"count":10,"href":"https:\/\/itp.indiamos.com\/blog\/wp-json\/wp\/v2\/posts\/551\/revisions"}],"predecessor-version":[{"id":930,"href":"https:\/\/itp.indiamos.com\/blog\/wp-json\/wp\/v2\/posts\/551\/revisions\/930"}],"wp:attachment":[{"href":"https:\/\/itp.indiamos.com\/blog\/wp-json\/wp\/v2\/media?parent=551"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/itp.indiamos.com\/blog\/wp-json\/wp\/v2\/categories?post=551"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/itp.indiamos.com\/blog\/wp-json\/wp\/v2\/tags?post=551"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}