We need a way to rank the list of documents that come back from a query. just giving unsorted list is probably not very useful because there might be many thousands of results. We want the most relevant results first, but what does that mean? Is that the best match according to the weights of the keywords? the list of documents with the highest page rank? a combination of both?
To match things according to relevance, we can simply score documents according to a variation of TFIDF (see Manning chapter 6) the we discussed that works well across documents. Or, we can use something called cosine similarity, discussed below.
As the Lucene documentation on Similarity.java describes, Lucene uses this approach:
One thing to keep in mind: The nature of your query problem and the nature of your data often dictates the solution. Further, this stuff is all still an art backed by science and mathematics. Each situation requires some analysis to figure out the best approach. There is no "one size fits all" solution to this problem.
TFIDF measures importance of a word in a doc relative to a corpus.
TFIDF uses local information from individual documents and global information from across documents. The formula for the importance or "weight" of word w usually is a variation of:
w_i = tf_i * log(1 / df_i)
where df_i = dcount_i / ndocs
where tf is term frequency, df is document frequency.
We discussed using Euclidean distance between two vectors X and Y for min-distance classification because proximity means similarity. The closer those vectors are, the more similar they are in feature space.
If we use word count instead of actual percentage (frequency), we can run into a problem when we are comparing the importance of words across documents instead of simply within a document (such as when looking at the keywords across Enron e-mails).
As Term Vector Theory and Keyword Weights points out,
We want the vectors to be normalized in some way so that proximity does in fact mean similarity. Imagine that we have two vectors that are very close to each other, but now I multiply one vector by 2. (One can imagine simply duplicating the contents of a document to get one exactly twice as long with exactly twice the word count.) Now, the vectors aren't as close. Increasing the length of the vector shouldn't really change the similarity measure. Visually:
What effect does multiplying a vector by 2 have in terms of what the vector represents? Multiplying a vector by scalar multiplies each element of the vector by that scalar. In this case, if we are using word count as our vector component units, we might have vectors like this for words oil and cash:
Multiply by two, we get a longer vector
2X = [ 4, 4 ]
So, multiplying by 2 simply says that there are twice as many references to words oil and cash. Notice, though, that it doesn't change their relative frequencies because we are multiplying each vector component by the same value. The ratio of count(oil)/count(cash) is still 1 even if we multiply the word counts by 2.
To prevent the length of the vector from throwing us off, we can simply normalize the vector lengths of X and Y. To normalize a vector, we have a couple of useful choices. Here are two:
This normalization makes the most important word have tf=1.0, which then make sense to compare uniformly across documents.
The whole point of this discussion of TFIDF and vector distance is that vector distance only makes sense for ranking document relevance if you've used normalized TF weight. The TF values have to be in the same units, in a sense, if we are going to compare the TFs for multiple documents.
To actually perform the merge sort of search results, we compute a score for each document and then order the results by that.
The situation is that we have a query with a fixed number of terms, say, n. Then we will get n lists back from our document database that we need to merge. For each document in each list, we compute the score. Then we do the merge with most highly scored documents shown first. The score is just the sum of the TFIDF with normalized TF values (see Manning eqn 6.9).
where w_i is the ith word in the query that has n terms. We compute this for every document. Given to lists of (score, term) tuples as returned from our get_word_freq() function in Python, we can use the sorted() function on the concatenated lists:
query = "cats dogs" = [cats, dogs]
Order is d1, d2, d3, which is decreasing score order.
not good for short queries like "best insurance"
Instead of summing up the TFIDF weights to determine importance of a document using a subset of the terms (from a query), we could also compute the distance of the query vector to the document vectors and sort by that. The idea is that relevant/similar documents will be at close to each other in n-term-space. Again, we have to have normalized the TF frequencies so that the n-space dimensions have the same units.
If vocab is 100 words long, each doc vector is 100 elements long. Fill these with tfidf scores. Missing words get score of 0.
Remember how we compute vector length, denoted |X|:
(Vector magnitude is also just the difference between X and the 0 vector.)
Unlike a Bayesian classifier, this minimum distance classification does not take into consideration the overall popularity of documents. In other words, we should give more weight to frequently requested documents that less frequently requested documents even if the search terms lean towards the less frequently requested.
Dramatic foreshadowing: visually, vector length might change depending on the word counts, but the angle stays the same!
What if we ignored vector length completely so we didn't have to worry about normalizing tf values? No problem, we can focus on the angle between the vectors instead to compare similarity. The idea is that similar vectors have similar angles with identical vectors laying on top of each other--their relative angle is zero, meaning no difference. If we cut and pasted document into itself to double its size, the length of the word count vector would double, say, from [2 9] to [4 18], but it's angle would not change.
All we have to do is figure out how to compute the angle between two vectors X and Y. Trigonometry to the rescue. First, let's assume we have to vectors on the unit circle (they are unit vectors),
then the cosine of the angle θ is:
where the dot product is:
We don't actually need the angle itself, θ, because the cosine of it is just as good and gives us a number in 0..1.
If the vectors are not on the unit circle, we can make them unit vectors (remember that the angle is what we care about--the magnitude is irrelevant so we can squish it down):
Substituting in the unit vectors we get:
When the cosine is 0, the documents have no difference--they are the same. If the cosine is 1, then they are orthogonal and considered to be completely different.
That is how you will find the cosine similarity measure expressed in textbooks and in the Lucene documentation.
Given the vector for each document returned from a Boolean query and the vector of the search query itself, we can compare the query vector against all those documents using the cosine similarity. We order by cos(θ).
Cosine similarity deals well with documents that don't have all the query terms, something we might want. For example, if we're looking for "how do I find the IP address of my computer", not all of the terms will be present in each document that is relevant.
After we have indexed a corpus, we have a complete set of the terms from all documents--the index. The vector space model simply says that those terms define a giant vector in hyperspace of length "size of index". Each term is a dimension. As we described above and as Lucene does, we use TFIDF as the weights in each dimension. If the document does not have a term, its weight is zero. This deals naturally with queries that don't contain all terms in the vocabulary. Documents can still have high scores even without having all the terms if they happen to have big weights for some important terms.
To compare documents and a query document, we compute the cosine similarity for the document vectors against the query vector and sort by that. It's important to note that the vector for each document is ordered by the indexed so that we are using the same dimensions and the same order. In other words, we sort the words in the document according to the index order not the order in which they appear in the document.
compute unit vector for query terms and put in vec with len = M = |vocab|. Then compute dot product with it and all d_i. Order by decreasing dot product (cos similarity). cos theta = 1 is perfect match. angle is 0.
Some result docs will pop up high that don't have all terms.
See Manning example 6.4, Fig 6.14 has the basic algorithm for computing vector space scores. If the word frequency in the query is 1, such as "best insurance", then the algorithm reduces to computing the scores shown above. All we do is sum up the weights (tfidf) for each query term in a particular document. That is its score. Notice that this means when the query vector has binary elements, cosine similarity reduces to simple summing of scores. This make sense for short queries like "best insurance", but not for full document comparisons. See Fig 7.1 for the updated algorithm.
We get away with this, even though the query vector elements being zero and one do not lead to a unit vector because Manning says:"
...it sufﬁces to compute the cosine similarity from each document unit vector v(d) to V(q) (in which all non-zero components of the query vector are set to 1), rather than to the unit vector v(q)."
Manning further says "For any document d, the cosine similarity V(q) dotprod v(d) is the weighted sum, over all terms in the query q, of the weights of those terms in d"
So presumably, we find the union of all postings lists for term queries and then run the algorithm to compute the scores for each document vector (using only terms from the query).
From there, we need top K queries. That means creating a heap in in 2J time for J documents with nonzero scores. reading the K queries takes KlogJ time.
sometimes we have different zones title vs document content. A query that matches the exact title or close to it should be weighted more heavily than content that matches those terms.
order matters and these mechanisms do "bag of words" type similarity comparisons. Once we get the weights for each word, we can use Spearman's rank (http://en.wikipedia.org/wiki/Spearman's_rank_correlation_coefficient) to consider the order of the words. It looks at the words of the document in sequence whereas we have sorted the document terms by the order of the index in the vector space model.
When looking for the top K documents, it's too slow to sort so we build and keep in O(2N) for N total documents and then we can get the top score in O(logN) and same for K top scores.