Lectures‎ > ‎

Frequency Analysis

Frequency analysis is a general term that refers to counting occurrences of certain values associated with a particular phenomenon. We can count all sorts of things like letters, words, links, emoticons, n-grams, and so on.

It's amazing what you can do simply by counting things and dividing by the number of things you have. Frequency analysis is useful across many domains such as cryptography, security, and information retrieval. For example,

  • as we will see in a lab, computing the character frequency of an alphabet can be used to decrypt substitution ciphers
  • word frequency can be used for speaker identification, language identification, group membership, ...
  • n-grams are just meta-words really that refer to sequences of length n. While more expensive to store and compute, they can provide more accuracy.
  • we can analyze log files, looking for usage trends of our website or even attacks and unauthorized access (counting the occurrence of referring websites coming IP addresses, and so on?)

If you can do basic arithmetic, and do log functions, you can handle frequency analysis.

We are going to look at character, word, and n-gram frequency analysis.

Character frequencies

Some written language character frequencies

Here English versus Spanish:

What you notice?

Here is English sorted by character frequency (histogram):

here is a comparison among a number of languages:

Ok, so what? what can we do this?

Language identification

For one thing, we can use it to identify the language of the document, as long as the document is long enough to have statistically meaningful character frequencies. From a known large corpus, we compute the character frequencies a priori. Given a document, we can count the character frequencies and then compare the documents character frequencies to the corpus character frequencies of various languages. To determine similarity, we treat the character frequencies a-z as vectors of length 26. For example, we might have English character frequencies as:

a	8.167%
b	1.492%
c	2.782%
d	4.253%
e	12.702%
f	2.228%
...

That defines a center of mass in a 26-dimensional hyperspace. Given a document, we compute the distance (using the so-called "L2 distance") to the centers of mass for each language. We declare the document as a member of the language to which the character frequency vector is closest.

We make the assumption that the character frequencies are independent of each other, even though that is not a very good assumption. It still works really well.


The distances d1, d2 tell you how close you are to the centroid of a particular language. They can even tell you how confident you are of your classification.

Useful python code to count freq char:

from _collections import defaultdict
import string
import sys
 
text = sys.stdin.read()
text = text.translate(None, string.punctuation+"0123456789"+'\n'+' ')
text = text.lower()
 
cfreq = defaultdict(int)
for c in text:
    cfreq[c] += 1
 
n = len(text)
for c in cfreq:
    print "%d %2.2f %s" % (cfreq[c], float(cfreq[c])/n*100, c)
$ export LC_ALL='C' # non ascii char local for sort program
$ python charfreq.py |sort -r -n
That defines a center of mass in a 26-dimensional hyperspace. Given a document, we
 compute the distance (using the so-called "L2 distance") to the centers of mass 
for each language. We declare the document as a member of the language to which 
the character frequency vector is closest.
36 16.00 e
21 9.33 a
20 8.89 t
17 7.56 c
16 7.11 s
15 6.67 n
13 5.78 o
12 5.33 h
10 4.44 r
10 4.44 i
8 3.56 m
8 3.56 l
8 3.56 d
7 3.11 u
6 2.67 g
6 2.67 f
3 1.33 w
3 1.33 p
2 0.89 y
2 0.89 v
1 0.44 q
1 0.44 b

To sort by char:

$ python charfreq.py | awk '{print $3, $2}' | sort
That defines a center of mass in a 26-dimensional hyperspace. Given a document, we
 compute the distance (using the so-called "L2 distance") to the centers of mass 
for each language. We declare the document as a member of the language to which 
the character frequency vector is closest.
a 9.33
b 0.44
c 7.56
d 3.56
e 16.00
f 2.67
g 2.67
h 5.33
i 4.44
l 3.56
m 3.56
n 6.67
o 5.78
p 1.33
q 0.44
r 4.44
s 7.11
t 8.89
u 3.11
v 0.89
w 1.33
y 0.89

Compare to graphs above. seems off; e.g., 'b' is .44% vs expected 1.492%. small sample. Bigger sample:

$ wc generic-drugs.txt 
     688    3739   37818 generic-drugs.txt
$ cat generic-drugs.txt | python charfreq.py | awk '{print $3, $2}'|sort
a 7.27
b 1.36
c 4.94
d 5.88
...

Bi-gram analysis

Using bi-grams instead of characters (1-grams) gives you an even stronger language identifier.
Cornell University Math Explorer's Project:

th 1.52%       en 0.55%       ng 0.18%
he 1.28%       ed 0.53%       of 0.16%
in 0.94%       to 0.52%       al 0.09%
er 0.94%       it 0.50%       de 0.09%
an 0.82%       ou 0.50%       se 0.08%
re 0.68%       ea 0.47%       le 0.08%
nd 0.63%       hi 0.46%       sa 0.06%
at 0.59%       is 0.46%       si 0.05%
on 0.57%       or 0.43%       ar 0.04%
nt 0.56%       ti 0.34%       ve 0.04%
ha 0.56%       as 0.33%       ra 0.04%
es 0.56%       te 0.27%       ld 0.02%
st 0.55%       et 0.19%       ur 0.02%

From Markus Dickinson, here are some 3-grams to compare English and Japanese:

n-gram English Japanese
aba 12 54
ace 95 10
act 45 1
arc 8 0

Cracking substitution ciphers

Using characters and character bi-grams is an extremely effective means of cracking substitution ciphers. We will do a lab on this.

Stop words

It's also the case that we might want to get rid of certain words called stop words. Articles like "the" and even some helper verbs like "do" don't really impart any meaning. When you ask "how do i find my ip address" in Google, it ignores everything except for "ip address". The turns out that Google is not using stopwords (most likely)--it's probably using simple inverse document frequency. but, this illustrate the point about stopwords being source of noise.

Determine the stopwords by word frequency in the collection (collection frequency). The words that are used most commonly such as "the" will bubble to the top. For example, using a low Python program, and a sample document from an FDA government website, I got a histogram that starts like this:

0.035615 the
0.026604 fda
0.025316 drug
0.024029 of
0.021455 and
0.018022 to
0.018022 generic
0.017807 gov
0.016735 http
0.015662 files
0.015447 nps
0.015447 https
0.015447 govdocs
0.015447 edu
0.015447 domex
0.015447 corp
0.015018 www
0.012658 htm
0.011800 pdf
0.011800 for
0.010942 html
0.010727 drugs
0.010727 cder
0.007509 in
...

Python code to show word freq, shows most freq words:

from math import log
import string
import sys
 
# Given a list of words, return a list of pairs: (word freq, word)
def histo(words): # compute TF
    histo = {}
    nwords = len(words)
    for word in words:
        histo[word] = histo.get(word, 0) + 1
    return [(histo[w]/float(nwords), w) for w in histo]
 
# MAIN
 
text = sys.stdin.read()
# regex thing to strip punc/digits we don't want
trans = string.maketrans(string.punctuation+"0123456789",
                         " " * (len(string.punctuation)+10))
text = string.translate(text, trans)   # strip out punct, digits
text = text.lower()
words = string.replace(text, '\n', ' ').strip('\n ').split(' ')
words = [w for w in words if len(w)>1] # ignore I, A, etc...
 
wordFreq = histo(words) # compute frequency of each word in input text
 
for w in wordFreq:
    tf = w[0]
    term = w[1]
    print "%f %s" % (tf, term)

I sorted the output of the Python program using the commandline:

python wordfreq.py < generic-drugs.txt|sort -r -n

Notice how quickly the frequency drops off and how small the frequency is for the first element. "the" is only 3.5% of the words, despite being the most common by far. Anyway after 25 words or so, the frequencies are down by an order of magnitude. The other thing to point out is that there are plenty of words that are uncommon, such as FDA, that are used very frequently. You can't simply ignore the top 20 most frequently used words.

Manning: "The general trend in IR systems over time has been from standard use of quite large stop lists (200-300 terms) to very small stop lists (7-12 terms) to no stop list whatsoever. Web search engines generally do not use stop lists." Inverse document frequency is sufficient to push down these words in importance.

Author identification

We use Stylometry: the linguistic fingerprint. The most famous author identification project deals with the Federalist papers. From Markus Dickinson:

  • The Federalist Papers were a series of 85 articles written between 1787 and 1788 by James Madison, Alexander Hamilton and John Jay to persuade New York to ratify the Constitution.
  • Some of the papers were clearly written by one of the three; 12 are in question, written either by Hamilton or Madison.
  • Mosteller and Wallace (1964) examined the frequency of various words in the disputed papers and compared each to a model of known Hamilton writings and known Madison writing

From Science news (I think).

  • At first, we think it makes sense to look at the rarest words and author's writing in order to train our classifier.
  • That make sense because "the" and "to" are used so much cross all English writing that they are useless for classification (document frequency inversely identifies good classification features)
  • Turns out, however, that rare words aren't what we want them. people often copy interesting and noticeable words but they can't my style of using "the", "and", and so on.
  • Hamilton used the word "upon" about 10 times as often as Madison did, for example.

Using bi-gram analysis for rare word pairs could make this even stronger. Some researchers think that neural pathways in our brains are such that we are predisposed to pair one word with another. For example, I knew a guy in school then never said "yes". He always said "pretty much". Really annoying. We could also distinguish between someone less than 30 years old by how often they say "you know". ha!

From science news article: "most of the methods require the unknown text to contain at least 1,000 words".

Other possible applications: Given a corpus of company e-mails, can we identify an anonymous e-mail or document? This is relevant in lawsuits and such.

Vocabulary discovery

One of my goals when building the jguru.com forum manager years ago, was to increase the signal-to-noise ratio. In a lot of forums you will see people get off topic and start talking about movies in a java forum. In order to reduce the noise, I wanted to filter out non-java related posts, but how do you know if something is talking about Java? you can't just ask for keywords java etc... because they might be talking about a coffee topic.

What I did was to create a large English corpus by screen scraping the New York Times website (this was years ago before there were many such corpora on the web). Then, I got all of the Java fact entries we had (5000 or so). Clearly the first corpus is pure English and the second corpus is English plus java. I reasoned that by doing a fuzzy set difference, I could arrive at job of vocabulary:

java vocabulary = (java+English) - English

I did word frequency analysis on both of the corpses. Some words are really popular across both corpora, such as the English articles "the", "a". Some words are really popular in "Java speak" like Java, database, Swing, Object, window. Using what amounts to TFIDF (term frequency, inverse document frequency), I boosted all terms that were used frequently in corpus but penalized to those words if they were popular in both corpora. Then, a simple threshold (which I eyeballed by looking at the data) identified everything above that threshold as Java speak.

Then, to prevent people from talking about databases in the IO form, I did a similar analysis that discover the lexicon of the various forum topics. Again, I used the human edited and groomed faq entries for training data because I knew precisely what the topics were.

This approach would also work for distinguishing between different kinds of novels if we had access to all of Amazon's data, for example. We could find the lexicon of love stories versus science fiction. Of course, we get better results if we did bi-gram analysis instead of just word analysis.

Comments