Assignments‎ > ‎

Search Engine Part II

Due Sunday, Dec 2, 2012 at midnight CA time


Goal

Your goal in this project is to extend the work you did in the first part to support conjunction queries. Given string "storm gulf winds coast", query() to return the list of document IDs that contain each of those words. The list should be sorted by highest to lowest document score, where the score is the sum of the individual query term TFIDF weights:

/** Return ordered list of document IDs of how search results
*  should be displayed, highest TFIDF-based score first.
*  Perform an AND (conjunction) of all query terms.
*
*  Find all docs with all terms in query then compute TFIDF scores
*  for those words and sum them up per document as the doc score.
*
* @param queryString The string with words to look up; all words
*                    must be present in a doc to have it returned.
* @return list of doc IDs ordered from highest to lowest TFIDF doc score
*/
List<Integer> query(String queryString);

For example, here's a simple unit test:

@Test public void testQuery() {
writeFile(tmpdir, "a.txt", "I think death is the most wonderful " +
"invention of life. It purges the system of these old " +
"models that are obsolete.");
writeFile(tmpdir, "b.txt", "I would trade all my technology for an " +
"afternoon with Socrates.");
writeFile(tmpdir, "c.txt", "What a computer is to me is the most " +
"remarkable tool that we have ever come up with. It is the " +
"equivalent of a bicycle for our minds.");
Indexer index = new HashIndex();
boolean error = false;
try {
index.addAll(tmpdir);
}
catch (IOException ioe) {
error = true;
ioe.printStackTrace(System.err);
}
assertFalse(error);
assertEquals("[3, 1]", index.query("is").toString());
}

My solution gets the following document scores and ordering:

(c.txt, 0.04678443555094205)
(a.txt, 0.02027325540540822)

When I perform query("storm gulf winds coast"), I get the following documents back.

"storm gulf winds coast" unsorted docs: [7, 100, 579, 120, 3161, 146, 2605, 172, 160, 2842]
Sorted:
(100132newsML.xml.txt, 0.4058283212626737)
(100109newsML.xml.txt, 0.39978073466645836)
(100145newsML.xml.txt, 0.38231975874849866)
(100090newsML.xml.txt, 0.3516083455851565)
(100006newsML.xml.txt, 0.33938015153546197)
(100156newsML.xml.txt, 0.3266469399820718)
(100526newsML.xml.txt, 0.31310450242340365)
(102584newsML.xml.txt, 0.21987262846607697)
(102875newsML.xml.txt, 0.19014300802269024)
(102369newsML.xml.txt, 0.12531335790920414)

Do not generate any output (no debugging or otherwise). Your query() method should simply return the list of document IDs in the right order.

So you can test your TF IDF scores, here's my scoring for document 7:

/Users/parrt/courses/CS680/data/articles/100006newsML.xml.txt 
0.15464 hurricane
0.15372 storm
0.13462 fla
0.13445 center
0.10022 josephine
0.09977 tropical
0.08523 winds
0.08480 coast
0.08371 kph
0.07912 warning
0.06981 mph
0.06797 anclote
0.05993 keys
0.05301 forecasters
0.05155 apalachicola
0.04910 moving
0.04614 effect
0.03991 rapidly
0.03922 northeast
0.03702 near
0.03702 jospehine
0.03702 apalachicaola
0.03518 km
0.03355 south
0.03322 miles
0.03221 walton
0.03094 northward
0.03094 inlet
0.02996 canaveral
0.02917 outward
0.02849 meters
0.02849 longitude
0.02849 latitude
0.02790 landfall
0.02790 cm
0.02692 rainfall
0.02692 inches
0.02612 onshore
0.02577 prompting
0.02551 west
0.02545 venice
0.02545 fort
0.02515 beach
0.02486 warnings
0.02460 ashore
0.02443 east
0.02388 northeastern
0.02388 cape
0.02367 flooding
0.02327 coastal
0.02308 path
0.02225 sustained
0.02211 southwest
0.02074 amounts
0.01995 river
0.01987 surge
0.01987 edt
0.01973 national
0.01953 extended
0.01921 evening
0.01907 feet
0.01892 toward
0.01892 cross
0.01871 maximum
0.01865 track
0.01821 florida
0.01784 or
0.01759 cause
0.01641 along
0.01610 mexico
0.01564 gulf
0.01476 post
0.01435 came
0.01372 force
0.01372 area
0.01355 continued
0.01353 least
0.01327 north
0.01315 become
0.01279 were
0.01229 possible
0.01186 little
0.01148 gmt
0.01090 much
0.01087 where
0.01055 strong
0.01006 was
0.00952 higher
0.00930 this
0.00768 from
0.00751 could
0.00733 to
0.00657 expected
0.00630 more
0.00611 about
0.00610 usa
0.00553 the
0.00541 and
0.00502 of
0.00454 up
0.00451 monday
0.00450 at
0.00410 will
0.00307 for
0.00273 said
0.00269 with
0.00262 in
0.00247 is
0.00242 by
0.00231 it
0.00088 on

Submission

You will create a jar file called index2.jar containing *.class files and place it in a directory called index2/dist under your cs680 dir:

https://www/svn/userid/cs680/index2/dist/index2.jar

Pur your source Java code in index/src:

https://www/svn/userid/cs680/index2/src/...

To jar your stuff up, you will "cd" to the directory containing your source code and create the jar in the index2 dir:


cd ~/cs680/index2/src

jar cvf ~/cs680/index2/dist/index2.jar *.class

cd ~/cs680/index2/dist

svn add index2.jar

svn commit

All classes must be in the default package!

To learn more about submitting your project with svn, see Resources.

You must submit your source code for credit.

Grading

I will run unit tests that extend the BabyTestIndex.java file again but will also test your code on the articles directory.

You may discuss this project in its generality with anybody you want and may look at any code on the internet except for a classmate's code. You should physically code this project completely yourself but can use all the help you find other than cutting-n-pasting or looking at code from a classmate or other Human being.

I will deduct 10% if your program is not executable exactly in the fashion mentioned in the project; that is, class name, methods, lack-of-package, and jar must be exactly right. For you PC folks, note that case is significant for class names and file names on unix! All projects must run properly under linux at amazon.

ċ
Indexer.java
(3k)
Terence Parr,
Nov 25, 2012, 3:06 PM
Comments