Frequency analysis is a general term that refers to counting occurrences of certain values associated with a particular phenomenon. We can count all sorts of things like letters, words, links, emoticons, n-grams, and so on.
It's amazing what you can do simply by counting things and dividing by the number of things you have. Frequency analysis is useful across many domains such as cryptography, security, and information retrieval. For example,
If you can do basic arithmetic, and do log functions, you can handle frequency analysis.
We are going to look at character, word, and n-gram frequency analysis.
Here English versus Spanish:
What you notice?
Here is English sorted by character frequency (histogram):
here is a comparison among a number of languages:
Ok, so what? what can we do this?
For one thing, we can use it to identify the language of the document, as long as the document is long enough to have statistically meaningful character frequencies. From a known large corpus, we compute the character frequencies a priori. Given a document, we can count the character frequencies and then compare the documents character frequencies to the corpus character frequencies of various languages. To determine similarity, we treat the character frequencies a-z as vectors of length 26. For example, we might have English character frequencies as:
That defines a center of mass in a 26-dimensional hyperspace. Given a document, we compute the distance (using the so-called "L2 distance") to the centers of mass for each language. We declare the document as a member of the language to which the character frequency vector is closest.
We make the assumption that the character frequencies are independent of each other, even though that is not a very good assumption. It still works really well.
The distances d1, d2 tell you how close you are to the centroid of a particular language. They can even tell you how confident you are of your classification.
Useful python code to count freq char:
To sort by char:
Compare to graphs above. seems off; e.g., 'b' is .44% vs expected 1.492%. small sample. Bigger sample:
Using bi-grams instead of characters (1-grams) gives you an even stronger language identifier.
From Markus Dickinson, here are some 3-grams to compare English and Japanese:
Using characters and character bi-grams is an extremely effective means of cracking substitution ciphers. We will do a lab on this.
It's also the case that we might want to get rid of certain words called stop words. Articles like "the" and even some helper verbs like "do" don't really impart any meaning. When you ask "how do i find my ip address" in Google, it ignores everything except for "ip address". The turns out that Google is not using stopwords (most likely)--it's probably using simple inverse document frequency. but, this illustrate the point about stopwords being source of noise.
Determine the stopwords by word frequency in the collection (collection frequency). The words that are used most commonly such as "the" will bubble to the top. For example, using a low Python program, and a sample document from an FDA government website, I got a histogram that starts like this:
Python code to show word freq, shows most freq words:
I sorted the output of the Python program using the commandline:
Notice how quickly the frequency drops off and how small the frequency is for the first element. "the" is only 3.5% of the words, despite being the most common by far. Anyway after 25 words or so, the frequencies are down by an order of magnitude. The other thing to point out is that there are plenty of words that are uncommon, such as FDA, that are used very frequently. You can't simply ignore the top 20 most frequently used words.
Manning: "The general trend in IR systems over time has been from standard use of quite large stop lists (200-300 terms) to very small stop lists (7-12 terms) to no stop list whatsoever. Web search engines generally do not use stop lists." Inverse document frequency is sufficient to push down these words in importance.
We use Stylometry: the linguistic fingerprint. The most famous author identification project deals with the Federalist papers. From Markus Dickinson:
From Science news (I think).
Using bi-gram analysis for rare word pairs could make this even stronger. Some researchers think that neural pathways in our brains are such that we are predisposed to pair one word with another. For example, I knew a guy in school then never said "yes". He always said "pretty much". Really annoying. We could also distinguish between someone less than 30 years old by how often they say "you know". ha!
From science news article: "most of the methods require the unknown text to contain at least 1,000 words".
Other possible applications: Given a corpus of company e-mails, can we identify an anonymous e-mail or document? This is relevant in lawsuits and such.
One of my goals when building the jguru.com forum manager years ago, was to increase the signal-to-noise ratio. In a lot of forums you will see people get off topic and start talking about movies in a java forum. In order to reduce the noise, I wanted to filter out non-java related posts, but how do you know if something is talking about Java? you can't just ask for keywords java etc... because they might be talking about a coffee topic.
What I did was to create a large English corpus by screen scraping the New York Times website (this was years ago before there were many such corpora on the web). Then, I got all of the Java fact entries we had (5000 or so). Clearly the first corpus is pure English and the second corpus is English plus java. I reasoned that by doing a fuzzy set difference, I could arrive at job of vocabulary:
java vocabulary = (java+English) - English
I did word frequency analysis on both of the corpses. Some words are really popular across both corpora, such as the English articles "the", "a". Some words are really popular in "Java speak" like Java, database, Swing, Object, window. Using what amounts to TFIDF (term frequency, inverse document frequency), I boosted all terms that were used frequently in corpus but penalized to those words if they were popular in both corpora. Then, a simple threshold (which I eyeballed by looking at the data) identified everything above that threshold as Java speak.
Then, to prevent people from talking about databases in the IO form, I did a similar analysis that discover the lexicon of the various forum topics. Again, I used the human edited and groomed faq entries for training data because I knew precisely what the topics were.
This approach would also work for distinguishing between different kinds of novels if we had access to all of Amazon's data, for example. We could find the lexicon of love stories versus science fiction. Of course, we get better results if we did bi-gram analysis instead of just word analysis.