Assignments‎ > ‎

Search Engine Part I

Due Wed, Nov 21, 2012

Goal


Your goal in this project is to learn how to build an efficient index for a collection of files that can answer questions about term and document frequency. Attached, you will find interface Indexer.java for which you must provide an implementation called HashIndex. My primary data structure is a HashMap of String->Term objects:

public class Term {
    String term;

    /** Term count across all documents */
    int countInCorpus;

    /** Term is in how many documents? */
    int ndocs = 0;

    /** Fast to insert, O(lgn) to search to see if doc has term.
     *  docID -> Posting.  Sorted by docID.
     *   
     *  (Using PriorityQueue with linear lookup, bumped us from
     *   19s using hash as index + treemap for postings to big 
     *   number. killed at 1min25s.)
     *   
     *   using hashmap dropped time from 19s to 15s 
     */  
    Map<Integer, Posting> postings;
}

and a Posting object is just the document ID and term count within that document:

public static class Posting {
    int docID;
    int countInDoc;
}


You should also maintain a document index so that you can quickly access the file name and word count of a document ID.

Map<Integer, Document> docs = new HashMap<Integer, Document>();

with

public class Document {
    String filename;
    /** How many terms in doc > length 1? */
    int nterms;
}

You must implement all of the methods from interface Indexer as described in the Java doc comments and also make sure to provide a default constructor, HashIndex(), for your implementation object. That is the constructor I will use for my unit tests, as you see in the attached unit test file.

Don't forget to normalize the case of all of the words, strip everything except a-z, and drop words less than two characters such as 'I' and 'a'.

Resources

I have attached some unit tests to help you get started and so the we are all on the same page. I've also attached to the interface you must implement.

The following function is much much faster than doing a split(" ") in Java to split apart strings into words. I suggest using this function.

public String[] split(String s, char separator) {
    List<String> words = new ArrayList<String>();
    int i=0;
    int n = s.length();
    while ( i<n) {
        int c = s.charAt(i);
        if ( c==separator ) { 
            i++;
            continue;
        }   
        StringBuilder buf = new StringBuilder();
        c = s.charAt(i);
        while ( c!=separator && i<n) {
            buf.append((char)c);
            i++;
            if ( i<n ) c = s.charAt(i);
        }   
        words.add(buf.toString());
    }
    String[] a = new String[words.size()];
    return words.toArray(a);
}   

boolean[] allow = new boolean[128];
for (int i='a'; i<='z'; i++) allow[i] = true;
allow[' '] = true;
...
text = text.toLowerCase();
text = filter(text, allow);
String[] words = split(text, ' ');

public static String filter(String s, boolean[] allow) {
int n = s.length();
StringBuilder buf = new StringBuilder(n);
char prev = 0;
for (int i=0; i<n; i++) {
char c = s.charAt(i);
if ( allow[c] ) {
buf.append(c);
}
else {
buf.append(' ');
}
prev = c;
}
return buf.toString();
}

Submission

You will create a jar file called index.jar containing *.class files and place it in a directory called index dist under your cs680 dir:

https://www/svn/userid/cs680/index/dist/index.jar

Pur your source Java code in index/src:

https://www/svn/userid/cs680/index/src/...

To jar your stuff up, you will "cd" to the directory containing your source code and create the jar in the index dir:


cd ~/cs680/index/src

jar cvf ~/cs680/index/dist/index.jar *.class

cd ~/cs680/index/dist

svn add index.jar

svn commit

All classes must be in the default package!

To learn more about submitting your project with svn, see Resources.

You must submit your source code for credit.

Grading

I will run unit tests that extend the attached BabyTestIndex.java file.

You may discuss this project in its generality with anybody you want and may look at any code on the internet except for a classmate's code. You should physically code this project completely yourself but can use all the help you find other than cutting-n-pasting or looking at code from a classmate or other Human being.

I will deduct 10% if your program is not executable exactly in the fashion mentioned in the project; that is, class name, methods, lack-of-package, and jar must be exactly right. For you PC folks, note that case is significant for class names and file names on unix! All projects must run properly under linux at amazon.

ċ
BabyTestIndex.java
(3k)
Terence Parr,
Nov 13, 2012, 1:12 PM
ċ
Indexer.java
(2k)
Terence Parr,
Nov 13, 2012, 12:49 PM
Comments