Notes on Building search index

Goal

Let's say we want to build a search engine, whether it is Boolean-based or free-form text like we do with Google. The first step is to find the all documents that satisfy the query. Then, we need to order the results to show the most relevant ones at the top. There are two basic ways to do that, the first of which is to analyze the contents of the documents. The second way to order search results is by the relative importance of the documents themselves (assuming the documents are linked in some way like the web or email logs are).

To find all documents that satisfy a query, we need a dictionary or index with functions that map:

  • posting(term, document) -> number of references to term in document
  • postings(term) -> list of documents that contain the term
  • doccount(term) -> number of documents that contain the term.
  • termcount(term) -> number of times term is referenced in all documents
Not that hard to do but hard to make it efficient in time, memory.

Process

Manny chap 1:

  1. Collect docs; assign each document unique ID: docID as we read it
  2. tokenize
  3. linquistic processing (stemming...), normalize
  4. index docs: dict + postings

Example

Doc1
10 years ago we had Steve Jobs, Bob Hope and Johnny Cash - Now we have no jobs, no hope and no cash.

term

word-count

ago

1

and

2

bob

1

cash

2

had

1

have

1

hope

2

jobs

2

johnny

1

no

3

now

1

steve

1

we

2

years

1

Doc2

Dear Blackberry, Thanks for honoring Steve Jobs' death with silence for 3 continuous days.

term

word-count

blackberry

1

continuous

1

days

1

dear

1

death

1

for

2

honoring

1

jobs

1

silence

1

steve

1

thanks

1

with

1

ndocs = 2, nchars = 145, nuniqchars = 112
nwords = 34, unique words = 24

Dictionary:

term

doc-count

ago

1

and

1

blackberry

1

bob

1

cash

1

continuous

1

days

1

dear

1

death

1

for

1

had

1

have

1

honoring

1

hope

1

jobs

2

johnny

1

no

1

now

1

silence

1

steve

2

thanks

1

we

1

with

1

years

1

index:

term

postings

ago

1

and

1

blackberry

2

bob

1

cash

1

dear

2

continuous

2

days

2

death

2

for

2

had

1

have

1

honoring

2

hope

1

jobs

1, 2

johnny

1

no

1

now

1

silence

2

steve

1, 2

thanks

2

we

1

with

2

years

1

Term vectors

for boolean queries.

doc1

term

word-freq

ago

1

and

2

blackberry

0

bob

1

cash

2

continuous

0

dear

0

days

0

death

0

for

0

had

1

have

1

honoring

0

hope

2

jobs

2

johnny

1

no

3

now

1

silence

0

steve

1

thanks

0

we

2

with

0

years

1

doc2

term

word-freq

ago

0

and

0

blackberry

1

bob

0

cash

0

continuous

1

dear

1

days

1

death

1

for

2

had

0

have

0

honoring

1

hope

1

jobs

1

johnny

0

no

0

now

0

silence

1

steve

1

thanks

1

we

0

with

1

years

0

binary encoding of presence or absence of term in doc

     doc 1 doc 2
 ago
 1 0
 and 1 0
 blackberry 0 1
 bob 1 0
 ... ... ...

Do we sort the elements in the postings list? Yes, so that we can do an efficient merge for Boolean queries. merging two ordered lists is much faster than merging unordered lists.

Do we store as arrays or linked lists? what are the trade-offs?

how do you add to a sorted list?

Where should we store the document frequency?

Where should we store the term frequencies?

ago             ndocs=1 nrefs=1 -> (doc=1,nrefs=1)
and             ndocs=1 nrefs=2 -> (1,2)
blackberry      ndocs=1 nrefs=1 -> (2,1)
bob             ndocs=1 nrefs=1 -> (1,1)
cash            ndocs=1 nrefs=2 -> (1,2)
continuous      ndocs=1 nrefs=1 -> (2,1)
days            ndocs=1 nrefs=1 -> (2,1)
dear            ndocs=1 nrefs=1 -> (2,1)
death           ndocs=1 nrefs=1 -> (2,1)
for             ndocs=1 nrefs=2 -> (2,2)
had             ndocs=1 nrefs=1 -> (1,1)
have            ndocs=1 nrefs=1 -> (1,1)
honoring        ndocs=1 nrefs=1 -> (2,1)
hope            ndocs=1 nrefs=2 -> (1,2)
jobs            ndocs=2 nrefs=3 -> (1,2) (2,1)
johnny          ndocs=1 nrefs=1 -> (1,1)
no              ndocs=1 nrefs=3 -> (1,3)
now             ndocs=1 nrefs=1 -> (1,1)
silence         ndocs=1 nrefs=1 -> (2,1)
steve           ndocs=2 nrefs=2 -> (1,1) (2,1)
thanks          ndocs=1 nrefs=1 -> (2,1)
we              ndocs=1 nrefs=2 -> (1,2)
with            ndocs=1 nrefs=1 -> (2,1)
years           ndocs=1 nrefs=1 -> (1,1)

Dictionary data structure

Do we use hash table or b-tree or sort list?

With a sorted list, we still need a pointer to the string and a pointer to the postings. Insertion can be expensive to shift elements in index down during insertion sort. many fewer pointers and binary tree, however.

Hash tables must be rehashed as the table grows, which is expensive. Hash tables also don't help us with prefixes or wildcard searches.

binary search trees. left child of a node is "less than" lexographically and right child is greater than. children are also binary search trees. These can become unbalanced easily, blowing the logN nature.

b-trees work well. they have multiple elements per node. The children nodes sit between the elements of the current node. balanced. combine multiple levels and so are easier to store on the disk if we want to avoid an in memory index.

Trees require an ordering of letters; Chinese, Japanese now have a standard order defined.

binary trees can be stored efficiently without pointers in an array. 2*i and 2*i+1.

Wildcard queries

What if we kept all terms at the leaves? Manning figure 3.1 shows a binary search tree whose edges split the children according to ranges. From the root, the left does a-m, the right does n-z. At the second level, the a-m node splits at a-hu and hy-m.

What about a DFA / Trie? Relies on the fact that many words start with the same prefix. Heck, even stripping out the first character a-z is a big win. On 500,000 terms we save about 500,000-26 characters. Of course, we need more pointers because we're using a pointer per character.

Search trees require ordering of char; does Mandarin support this? They have adopted standard ordering now.

DFA is like binary search tree on it's side. [Draw trie for root points at a-m, n-z nodes. then path to money, monday, monster, ...] Trailing query like mon* is easy. Walk down nodes per char in word until *. Then delineate all leaves.

Can do leading * by making reverse b-tree or trie. lemon would be root-n-o-m-e-l.

Can do general * by using both kinds of trees.

Spelling correction

  1. Given multiple correctly-spelled words for a misspelled query, choose the closest one, by some definition of closest. Edit distance and k-grams are common
  2. If two correctly spelled words for a misspelled word are tied in closeness, choose the most common word. for grnt, choose grant not grunt. By common, we can choose among the documents or among the queries typed in by users.
edit distance is the number of edit operations require to transform one string into the other. Edit operations:
  1. insert a character
  2. delete a character
  3. replace a character
The distance between cat and dog is 3.

Compressing postings lists

The gap between document IDs should be pretty small. keep the delta, rather than the full three or four byte document ID. Store the first doc ID and the deltas. This could drop 500,000 (documents) 4 byte integers to say 1-2 bytes, saving 1 MB of memory. we really need variable length encoding, however, because some gaps could be very large. gaps for stopwords will be 1 or 2 delta. One way is to use the upper 8th bit as a flag to say continue or not. For example, we can set it on if we need more bytes. The last byte in a sequence will have its flag 1. Manning says that he gets a 50% reduction in postings lists sizes. Sample posting lists:

triple          ndocs=2 nrefs=2 -> (300,1) (996,1) 

sheikh          ndocs=3 nrefs=6 -> (350,1) (391,1) (798,4) 

metra           ndocs=1 nrefs=1 -> (603,1) 

equitable       ndocs=2 nrefs=2 -> (207,1) (876,1) 

brent           ndocs=6 nrefs=17 -> (431,3) (630,3) (652,1) (708,5) (716,3) (743,2) 

lagrange        ndocs=1 nrefs=1 -> (87,1) 

benazir         ndocs=6 nrefs=14 -> (301,5) (302,5) (316,1) (332,1) (353,1) (386,1) 

expanding       ndocs=7 nrefs=8 -> (18,2) (115,1) (173,1) (213,1) (777,1) (778,1) (804,1) 

immigration     ndocs=4 nrefs=5 -> (162,1) (287,1) (350,2) (791,1) 

factions        ndocs=3 nrefs=3 -> (336,1) (434,1) (435,1) 

telmex          ndocs=1 nrefs=1 -> (443,1) 

birdsell        ndocs=1 nrefs=1 -> (158,1) 

researches      ndocs=1 nrefs=1 -> (552,1) 

into            ndocs=167 nrefs=228 -> (18,3) (28,1) (29,2) (30,1) (33,2) (38,2) (44,1) (49,4) (76,1) (81,1) (88,1) (104,1) (116,1) (124,2) (152,1) (154,1) (158,1) (169,1) (171,1) (175,1) (178,1) (181,2) (188,1) (193,1) (195,2) (196,3) (197,1) (205,1) (207,1) (208,1) (209,1) (217,2) (218,1) (221,4) (225,1) (226,2) (229,1) (235,2) (236,1) (237,1) (238,1) (250,1) (251,1) (256,1) (265,1) (268,1) (269,1) (272,1) (274,1) (277,1) (278,1) (281,1) (284,1) (287,2) (294,1) (303,3) (309,1) (311,2) (317,2) (319,2) (326,2) (327,1) (336,1) (339,1) (340,1) (342,2) (350,2) (356,1) (408,5) (409,2) (411,1) (413,1) (414,1) (420,1) (421,1) (427,1) (431,1) (433,2) (453,1) (484,1) (498,1) (502,1) (516,1) (522,2) (541,2) (546,2) (553,1) (563,4) (565,1) (576,1) (577,1) (580,1) (582,1) (589,1) (599,1) (607,3) (610,2) (619,1) (620,1) (639,2) (643,1) (647,1) (652,1) (653,1) (663,1) (673,1) (680,1) (692,1) (695,1) (696,1) (703,1) (708,1) (734,1) (736,1) (741,1) (745,1) (751,1) (757,1) (758,2) (761,2) (762,1) (765,3) (778,2) (786,1) (804,2) (807,1) (808,1) (817,1) (821,1) (822,2) (823,1) (827,2) (828,2) (831,1) (834,2) (838,1) (841,1) (844,1) (845,1) (846,1) (853,1) (859,1) (860,2) (870,1) (871,1) (876,1) (885,1) (893,1) (898,1) (901,1) (907,2) (909,4) (921,1) (926,1) (930,1) (931,1) (932,1) (939,1) (942,1) (944,3) (948,1) (950,1) (953,1) (963,1) (970,1) (977,1) (987,1) 

Updating index

how do we deal with dynamically updated indexes? Can be very expensive to certain document. The addition of new terms means we have to update all term->postings mappings.

Can simply reindex everything at a certain frequency, such as everyday. jguru took about 15 or 20 min to do a full reindex with lucene. We used dynamic updates, however, upon each new forum post or whatever.

We create an auxiliary index for new entries. Queries combine the results of looking at both indexes. When the auxiliary index gets big enough or we make a compact call, the engine should merge the two indices. Deletions should ignored in search results (just set a bit to delete). To do an update, we added to the delete list and then do re-insertion (gets a new docID).

Boolean queries

In principle, we get the postings list for each term in AND query and then merge the lists.

to make this more efficient, start with the intersection of the smallest postings list. That way, all other mergers can be no bigger than the smallest postings list. we proceed by order of increasing document frequency (i.e., smallest to largest postings list).

For "(Steve or Bob) AND (cash or now) AND no", we can estimate conservatively the size of the union for each OR then figure out which AND to do first.

To do AND of words:

Term first = terms.get(0);

// Get postings for query terms

List<Term> queryterms = new ArrayList<Term>();

for (String w : uniquewords) { 

        terms.add(index.get(w));

}       

// Do an intersection

Term first = terms.get(0);

Set<Integer> intersection = docids(first);

for (int i=1; i< queryterms.size(); i++) {

        Term t = queryterms.get(i);

        intersection.retainAll(docids(t));


Comments