Lectures‎ > ‎

TF-IDF

Goal

Answer: how important is this word when trying to differentiate between documents? We want a numeric score that we can use to compare.

Manning et al says:

In other words, tf-idf assigns to term t a weight in document d that is
  1. highest when t occurs many times within a small number of documents (thus lending high discriminating power to those documents);
  2. lower when the term occurs fewer times in a document, or occurs in many documents (thus offering a less pronounced relevance signal);
  3. lowest when the term occurs in virtually all documents.
We start out with term frequency, tf = |ref(t,d)| / |d| where |ref(t,d)| is the number of references to t in d and |d| is the size in words of the document.  It gives us a stronger score higher the percentage of that term in a particular document. That gives us a high score for words that are mentioned a lot such as president in a political document. Unfortunately, this also gives us a high score for words like the.

The solution is to reduce the score of a term tf by how often the term is used across documents. We can try the following

tf / |doc(t)|

where doc(t) is the set of documents containing t. Unfortunately, this doesn't work super well.  The score is not attenuated enough by the denominator. what we really need is some kind of exponential that drives the score way down as the document count gets close to "every document" (df goes to 1). If we go to document frequency instead of document count, then we can use divide by log(doc(t) / ndocs)=log(df) or multiply by log(1/df):

Term t = index.get(w);
Posting p = t.postings.get(docID);
int docSize = docs.get(docID).nterms;
double tf = p.countInDoc / (double)docSize;
double df = (double)t.ndocs / ndocs;
double score = tf * Math.log(1 / df);

Document score


Score(q, d) = ∑ of tfidf(t,d) for 
t∈q

Corpus vocabulary

Goal: how can we figure out the most important terms in a collection of documents? In other words, if we have a group of political documents, what is the vocabulary?

 If we talk about a term a lot, it should get a high score. If that word is used a lot across documents, it should get a lower score. As that df approaches unity, the term should get a exponentially lower score. I found in my experiments that log(1/df)^2 do a much better job of pushing 'the', 'that', and so on downwards.

double tf = t.countInCorpus / (double)nwords;
double df = (double) t.ndocs / ndocs;
// double score = tf * Math.log(1 / df); // doesn't push score down far enough
double score = tf * Math.log(1 / df) * Math.log(1 / df);

Some terms from defense articles from Reuters:

0.01025 nato           , tc=31, tf=0.0010, df=0.043, doccount=4, ndocs 94, 1/df=23.5000, log(1/df)=3.1570
0.00985 taiwan         , tc=44, tf=0.0015, df=0.074, doccount=7, ndocs 94, 1/df=13.4286, log(1/df)=2.5974
0.00926 japan          , tc=46, tf=0.0015, df=0.085, doccount=8, ndocs 94, 1/df=11.7500, log(1/df)=2.4639
0.00895 iraq           , tc=40, tf=0.0013, df=0.074, doccount=7, ndocs 94, 1/df=13.4286, log(1/df)=2.5974
0.00866 ukraine        , tc=22, tf=0.0007, df=0.032, doccount=3, ndocs 94, 1/df=31.3333, log(1/df)=3.4447
0.00833 treaty         , tc=204, tf=0.0068, df=0.330, doccount=31, ndocs 94, 1/df=3.0323, log(1/df)=1.1093
0.00828 gujral         , tc=29, tf=0.0010, df=0.053, doccount=5, ndocs 94, 1/df=18.8000, log(1/df)=2.9339
0.00809 india          , tc=187, tf=0.0062, df=0.319, doccount=30, ndocs 94, 1/df=3.1333, log(1/df)=1.1421
0.00787 bnd            , tc=16, tf=0.0005, df=0.021, doccount=2, ndocs 94, 1/df=47.0000, log(1/df)=3.8501
0.00766 chemical       , tc=46, tf=0.0015, df=0.106, doccount=10, ndocs 94, 1/df=9.4000, log(1/df)=2.2407
0.00745 sale           , tc=37, tf=0.0012, df=0.085, doccount=8, ndocs 94, 1/df=11.7500, log(1/df)=2.4639
0.00712 indonesia      , tc=39, tf=0.0013, df=0.096, doccount=9, ndocs 94, 1/df=10.4444, log(1/df)=2.3461
0.00689 lien           , tc=14, tf=0.0005, df=0.021, doccount=2, ndocs 94, 1/df=47.0000, log(1/df)=3.8501
0.00689 arrow          , tc=14, tf=0.0005, df=0.021, doccount=2, ndocs 94, 1/df=47.0000, log(1/df)=3.8501
0.00685 buyoya         , tc=10, tf=0.0003, df=0.011, doccount=1, ndocs 94, 1/df=94.0000, log(1/df)=4.5433
0.00675 australia      , tc=48, tf=0.0016, df=0.128, doccount=12, ndocs 94, 1/df=7.8333, log(1/df)=2.0584
0.00644 british        , tc=32, tf=0.0011, df=0.085, doccount=8, ndocs 94, 1/df=11.7500, log(1/df)=2.4639
0.00639 qantas         , tc=13, tf=0.0004, df=0.021, doccount=2, ndocs 94, 1/df=47.0000, log(1/df)=3.8501
0.00616 rifkind        , tc=9, tf=0.0003, df=0.011, doccount=1, ndocs 94, 1/df=94.0000, log(1/df)=4.5433
0.00616 bp             , tc=9, tf=0.0003, df=0.011, doccount=1, ndocs 94, 1/df=94.0000, log(1/df)=4.5433
0.00592 missile        , tc=61, tf=0.0020, df=0.181, doccount=17, ndocs 94, 1/df=5.5294, log(1/df)=1.7101
0.00590 libya          , tc=12, tf=0.0004, df=0.021, doccount=2, ndocs 94, 1/df=47.0000, log(1/df)=3.8501
0.00582 nuclear        , tc=287, tf=0.0095, df=0.457, doccount=43, ndocs 94, 1/df=2.1860, log(1/df)=0.7821
0.00568 text           , tc=67, tf=0.0022, df=0.202, doccount=19, ndocs 94, 1/df=4.9474, log(1/df)=1.5989
0.00562 exercise       , tc=17, tf=0.0006, df=0.043, doccount=4, ndocs 94, 1/df=23.5000, log(1/df)=3.1570

 and here are some terms from political articles:

0.01047 lebed          , tc=334, tf=0.0017, df=0.083, doccount=47, ndocs 564, 1/df=12.0000, log(1/df)=2.4849
0.00943 yeltsin        , tc=381, tf=0.0019, df=0.110, doccount=62, ndocs 564, 1/df=9.0968, log(1/df)=2.2079
0.00942 dole           , tc=275, tf=0.0014, df=0.074, doccount=42, ndocs 564, 1/df=13.4286, log(1/df)=2.5974
0.00786 clinton        , tc=406, tf=0.0021, df=0.142, doccount=80, ndocs 564, 1/df=7.0500, log(1/df)=1.9530
0.00765 israel         , tc=199, tf=0.0010, df=0.064, doccount=36, ndocs 564, 1/df=15.6667, log(1/df)=2.7515
0.00694 palestinian    , tc=148, tf=0.0008, df=0.048, doccount=27, ndocs 564, 1/df=20.8889, log(1/df)=3.0392
0.00691 chechnya       , tc=228, tf=0.0012, df=0.087, doccount=49, ndocs 564, 1/df=11.5102, log(1/df)=2.4432
0.00671 bosnia         , tc=122, tf=0.0006, df=0.037, doccount=21, ndocs 564, 1/df=26.8571, log(1/df)=3.2905
0.00661 osce           , tc=84, tf=0.0004, df=0.020, doccount=11, ndocs 564, 1/df=51.2727, log(1/df)=3.9372
0.00628 arafat         , tc=111, tf=0.0006, df=0.035, doccount=20, ndocs 564, 1/df=28.2000, log(1/df)=3.3393
0.00628 anc            , tc=111, tf=0.0006, df=0.035, doccount=20, ndocs 564, 1/df=28.2000, log(1/df)=3.3393
0.00623 apartheid      , tc=110, tf=0.0006, df=0.035, doccount=20, ndocs 564, 1/df=28.2000, log(1/df)=3.3393
0.00615 russian        , tc=263, tf=0.0013, df=0.117, doccount=66, ndocs 564, 1/df=8.5455, log(1/df)=2.1454
0.00613 tax            , tc=255, tf=0.0013, df=0.113, doccount=64, ndocs 564, 1/df=8.8125, log(1/df)=2.1762
0.00607 burundi        , tc=66, tf=0.0003, df=0.014, doccount=8, ndocs 564, 1/df=70.5000, log(1/df)=4.2556
0.00552 refugees       , tc=112, tf=0.0006, df=0.044, doccount=25, ndocs 564, 1/df=22.5600, log(1/df)=3.1162
0.00552 netanyahu      , tc=109, tf=0.0006, df=0.043, doccount=24, ndocs 564, 1/df=23.5000, log(1/df)=3.1570
0.00547 budget         , tc=403, tf=0.0020, df=0.195, doccount=110, ndocs 564, 1/df=5.1273, log(1/df)=1.6346
0.00545 grozny         , tc=171, tf=0.0009, df=0.082, doccount=46, ndocs 564, 1/df=12.2609, log(1/df)=2.5064
0.00535 bossi          , tc=43, tf=0.0002, df=0.007, doccount=4, ndocs 564, 1/df=141.0000, log(1/df)=4.9488
0.00528 tobacco        , tc=64, tf=0.0003, df=0.018, doccount=10, ndocs 564, 1/df=56.4000, log(1/df)=4.0325
0.00518 klerk          , tc=86, tf=0.0004, df=0.032, doccount=18, ndocs 564, 1/df=31.3333, log(1/df)=3.4447
0.00512 peres          , tc=71, tf=0.0004, df=0.023, doccount=13, ndocs 564, 1/df=43.3846, log(1/df)=3.7701
0.00506 israeli        , tc=108, tf=0.0005, df=0.048, doccount=27, ndocs 564, 1/df=20.8889, log(1/df)=3.0392

The importance of log(IDF)


tf * (1/df) works ok but tf * log(1 / df) works better. Dividing the term frequency by the document frequency definitely penalizes the term's score for being more frequent across documents.  However, as the document frequency gets lower and lower towards 0%, the score should go higher and higher.  That's what log(1/df) does for us; see the graph.  A df=100%=1 gives log(1/1) = log(1) = 0, which essentially wipes out the score to 0. A df=1%=0.01 = log(1/0.01) = log(100) = 4.6.  We use df not document count to get 1 / df into the range (0..1].


Samples


ndocs = 2000
/Users/parrt/courses/CS680/data/reuters-vol1-disk1-subset/100000newsML.xml.txt
0.107 cotton         , tc=3, dc=12
0.100 rain           , tc=3, dc=17
0.097 carolinas      , tc=2, dc=2
0.090 storm          , tc=3, dc=27
0.079 inches         , tc=2, dc=7
0.073 georgia        , tc=2, dc=11
0.061 josephine      , tc=2, dc=25
0.061 tropical       , tc=2, dc=26
0.059 cent           , tc=2, dc=30
0.055 florida        , tc=2, dc=38
0.053 deluge         , tc=1, dc=1
0.053 players        , tc=2, dc=46
0.051 released       , tc=2, dc=52
0.048 deferreds      , tc=1, dc=2
0.045 jon            , tc=1, dc=3
0.043 suzanne        , tc=1, dc=4
0.042 meteorologist  , tc=1, dc=5
0.040 davis          , tc=1, dc=7
0.039 buoyed         , tc=1, dc=8
0.036 poised         , tc=1, dc=12
0.035 usda           , tc=1, dc=13
/Users/parrt/courses/CS680/data/reuters-vol1-disk1-subset/100001newsML.xml.txt
0.196 ballots        , tc=7, dc=2
0.123 johnston       , tc=4, dc=1
0.079 court          , tc=6, dc=76
0.062 galvin         , tc=2, dc=1
0.056 blank          , tc=2, dc=2
0.050 massachusetts  , tc=2, dc=4
0.047 congressional  , tc=2, dc=6
0.040 state          , tc=5, dc=280
0.038 votes          , tc=2, dc=19
0.037 name           , tc=3, dc=93
0.037 ruled          , tc=2, dc=21
0.035 william        , tc=2, dc=25
0.035 supreme        , tc=2, dc=28
0.033 democratic     , tc=2, dc=32
0.032 highest        , tc=2, dc=38
0.031 hillary        , tc=1, dc=1
0.031 review         , tc=2, dc=46
0.028 voter          , tc=1, dc=2
0.027 she            , tc=2, dc=68
0.026 nominee        , tc=1, dc=3
0.026 believe        , tc=2, dc=82
/Users/parrt/courses/CS680/data/reuters-vol1-disk1-subset/100002newsML.xml.txt
0.112 citrus         , tc=3, dc=1
0.085 storm          , tc=4, dc=27
0.068 boxes          , tc=2, dc=2
0.061 rangebound     , tc=2, dc=4
0.059 meteorologist  , tc=2, dc=5
0.053 november       , tc=4, dc=138
0.043 damage         , tc=2, dc=25
0.043 tropical       , tc=2, dc=26
0.039 florida        , tc=2, dc=38
0.037 juice          , tc=1, dc=1
0.037 estimate       , tc=2, dc=46
0.034 tree           , tc=1, dc=2
0.032 jon            , tc=1, dc=3
0.031 futures        , tc=2, dc=82
0.031 suzanne        , tc=1, dc=4
0.030 hit            , tc=2, dc=100
0.029 drift          , tc=1, dc=6
0.028 concentrated   , tc=1, dc=7
0.027 orange         , tc=1, dc=9
0.026 fruit          , tc=1, dc=10
0.025 ranged         , tc=1, dc=12

Comments