Notes partially derived from An introduction to information retrieval by Manning et al.
A token in the computer language world is very well-defined. For example, "x=y;" has four tokens: x, =, y, and ;. But, there are only three token types(categories): ID, =, ID, and ;. We're all familiar with integers, keywords, variables, and so on. We call those token types. The same thing happens when reviewing with English text. The phrase: "To go to sleep." has four tokens and three token types because "to" is repeated twice. we can also forget the period at the end because we are only interested in the words. Now, if we were looking for emoticons, then punctuation is very important. Also that's true if we're looking for links.
Dealing with contractions and the possessive "apostrophe s" in English can be a pain. Manning points out the following phrase:
Which tokenization should we use for O’Neill? neill, oneill, o’neill, o’ neill, o neill?
What about for aren't? aren’t, arent, are n’t, aren t?
The easiest thing to do is simply split along nonalphanumeric characters (and throw them out) but it messes up on contractions "aren t".
We also have to be careful about domain specific tokens that use punctuation such as Manning's examples: "C++ and C#, aircraft names like B-52, or a T.V. show name such as M*A*S*H".
Links also require special tokenization: http://www.antlr.org, email@example.com
hyphenation. co-education, hyphenated names: Berners-Lee, domain-specific language. It's not obvious what to do in all cases. We can either use statistical methods obtain from a corpus or use some simple heuristics that do not split apart short words with hyphens such as co-education but does split apart things like domain-specific.
Manning also points out that sometimes it's not appropriate to split on whitespace, particularly for city names like San Francisco phone numbers like (800) 234-2333 or dates. That is also related to bi-gram analysis where we look at two-word sequences.
We want words like anti-discriminatory and antidiscriminatory to be treated the same. That is, we want them in an equivalence class called, perhaps, antidiscriminatory. Searches of one, will find the other. By removing punctuation, we implicitly create equivalence classes and choose one of the words as the name of the equivalence class.
We can also compute or specify synonyms such as car and automobile.
how do we deal with Windows, the operating system, vs windows as in I open the windows? If we mean just the generic term for windows, then window/windows should be the same thing. In my experience, stripping plurals makes a big difference, but you can often be wrong. For example look above in the list of word frequencies. You will note http and https.
It's a good idea to normalize the case as well so http and HTTP are in the same equivalence class.
To strip plurals, we can make a function that first throws out words that are too short by some measure. then we test for said a special cases that we should not strip s from: has, was, does, goes, dies, yes, get, its. Then we apply special case plurality rules from English:
The key is that, even if we are wrong, we need to do the same thing to the words we index and the words we use during querying.
Porter stemmer (used by Lucene): Porter, 1980, An algorithm for suffix stripping, Program, Vol. 14, no. 3, pp 130-137.
In a single text, there can be many different variations on a certain word. For example, Porter gives the following example:
To improve accuracy, it's a good idea to treat all of those words as the same. stemming strips word endings using some heuristic. Porter's algorithm has five phases and is very complicated. Manning provides the following sample translation: