Your goal in this project is to learn how to build an efficient index for a collection of files that can answer questions about term and document frequency. Attached, you will find interface Indexer.java for which you must provide an implementation called HashIndex. My primary data structure is a HashMap of String->Term objects:
and a Posting object is just the document ID and term count within that document:
You should also maintain a document index so that you can quickly access the file name and word count of a document ID.
Map<Integer, Document> docs = new HashMap<Integer, Document>();
You must implement all of the methods from interface Indexer as described in the Java doc comments and also make sure to provide a default constructor, HashIndex(), for your implementation object. That is the constructor I will use for my unit tests, as you see in the attached unit test file.
Don't forget to normalize the case of all of the words, strip everything except a-z, and drop words less than two characters such as 'I' and 'a'.
I have attached some unit tests to help you get started and so the we are all on the same page. I've also attached to the interface you must implement.
The following function is much much faster than doing a split(" ") in Java to split apart strings into words. I suggest using this function.
You will create a jar file called index.jar containing *.class files and place it in a directory called index dist under your cs680 dir:
Pur your source Java code in index/src:
To jar your stuff up, you will "cd" to the directory containing your source code and create the jar in the index dir:
All classes must be in the default package!
To learn more about submitting your project with svn, see Resources.
You must submit your source code for credit.
I will run unit tests that extend the attached BabyTestIndex.java file.
You may discuss this project in its generality with anybody you want and may look at any code on the internet except for a classmate's code. You should physically code this project completely yourself but can use all the help you find other than cutting-n-pasting or looking at code from a classmate or other Human being.
I will deduct 10% if your program is not executable exactly in the fashion mentioned in the project; that is, class name, methods, lack-of-package, and jar must be exactly right. For you PC folks, note that case is significant for class names and file names on unix! All projects must run properly under linux at amazon.