New assignments will be added here each week.
Assignment 1 -- Due 19 Jan 04
a. Chapter 1 of Modern Information Retrieval by Baeza-Yates and Ribeiro-Neto
b. Article by Grefenstette and Tapanainen
What is a Word? What is a Sentence? (1994)
2. Send me email that provides me with the following information.
Your email address
Your status (undergrad, grad, which program you are in)
What prior programming experience do you have?
What would you like to get from this course?
Do you have a personal computer? If so, what kind?
3. Download perl and install it on your machine.
Assignment 2 -- Due 26 Jan 04
a. Chapter 2 of Modern Information Retrieval by Baeza-Yates and Ribeiro-Neto
b. Steve Lawrence and C. Lee Giles, NEC Research Institute
Searching the World Wide Web - From Science Magazine
Write a perl program to find words in a text file, count them and
display a frequency list for the words in the text. Your program
should use a subroutine that accepts a string as input and returns
a list of the words found. Try to make your function be a good one,
not just the simplest thing that works.
Here is some Test Data
that could be used to test your program.
3. Find at least one example of some string that would be problematic
for a word finding program, e.g. "4x4" or "$1million".
(Send me an email with your example(s)).
Assignment 3 -- Due 10 Feb 04
Each time I make a programming assignment, everyone in
the class will be turning in a program. In order to keep
track of each student's programs, I would like you to
adopt the following naming convention. Please make the
name of each program start with your initials followed
by _HWn.pl (where n is the number of the assignment
to which you are responding). Thus, for assignment 2,
my program would be called GVW_HW2.pl. If you need to
turn in more than one program for a single assignment,
place A, B, ... after the assignment number (e.g.
Refine your word finding program. This time, you should
make a subroutine called TokenizeWords that takes a
text string as an arguement and returns an array of the
words in the string. Your program should work with
the driver program provided HERE. The text that I used
for testing your previous assignments and discussed in
class is available HERE.
Make a subroutine called IndexableWords that takes
a list of words (from TokenizeWords) and post-processes
them into a list of words for indexing. It should work
with the same driver program as above.
Note: We will talk about this assignment in class.
Assignment 4 -- Due 17 Feb 04
Chapter 4 of Modern Information Retrieval by Baeza-Yates and Ribeiro-Neto
Write a program that will process a file. Your program should:
a. Identify documents (document boundaries)
b. Identify metadata like title and date
c. Identify the text body of the document
Your output does not need to be elaborate. This will become a part
of your indexer where the output will be the index files.
Assignment 5 -- Due 22 Feb 04
Chapter 8 of Modern Information Retrieval by Baeza-Yates and Ribeiro-Neto
In assignments 2 and 3, you wrote a subroutine to find words in a text.
In assignment 4, you wrote a program to handle multiple documents. For
this assignment, you should merge the two. Your program should accept a
list of files, each of which may contain multiple documents. It should
loop through the documents and extract relevant metadata (at least the
title) and identify the body of the text. For each document, your program
should print out the title, the total number of words (tokens) in the
text and the number of distinct words (types) in the text.
Assignment 6 -- Due 23 March 04
Program: Enhance your program from Assignment 5 to build your indexer.
You should either use the three file index structure discussed in
class or submit a description of the structure that you are using.
You will need to keep track of things like the byte position of the
documents to support retrieval later.
Assignment 7 -- Due 30 March 04
Write a preliminary version of your search engine.
Use the index files that you generated last time to
find files that contain words from a user query.