

README

This is a set of Perl scripts, subroutines, and sample data
demonstrating simple and rudimentary term frequency inverse document
frequency (TFIDF) techniques. Specifically, the content of this
distribution will search, classify, and compare tiny sets of plain text
files using term weighting algorithms.

The distribution includes the following files:

  o part*.txt - plain text blog postings describing TFIDF
  
  o scripts/*.txt - various plain text files; sample data
  
  o scripts/search.pl - does a simple search against the *.txt files
	and returns them in relevancy ranked order
  
  o scripts/classify.pl - extracts and lists the most statistically
	relevant words from each *.txt file
  
  o scripts/compare.pl - creates a matrix of the *.txt files listing
	how each document compares in similarity to every other
	document
  
  o scripts/subroutines.pl - a library of functions used in all the
	scripts
  
  o scripts/stopwords.inc - a stop word list
  
  o scripts/ideas.inc - a list of "big names" and "great ideas" used
	for finding more documents like this one
  
  o scripts/ideas.pl - search a corpus and rank according to "great
    ideas"
  
  o README - this file
  
  o LICENSE - a copy of the GNU Public License in which these
	files are distributed

You are encouraged to swap out the *.txt files for your own in
order to see for yourself what TFIDF can do for you, but be forewarned,
I doubt the system will work very well with a sample size of more than a
dozen files. More information about these files may to be located in a
set of three blog postings:

  1. http://infomotions.com/blog/2009/04/tfidf-in-libraries-part-i-for-librarians/
  2. http://infomotions.com/blog/2009/04/tfidf-in-libraries-part-ii-of-iii-for-programmers/
  3. http://infomotions.com/blog/2009/05/tfidf-in-libraries-part-iii-of-iii-for-thinkers/

Enjoy, learn, and pass along your knowledge to others.

P.S. I updated the compare subroutine in scripts/subroutines.pl to
remove a cosine function. This seems to resolve the "duplicate document"
problem currently mentioned in blog posting #3, but it also creates a
new problem. Previously similar document are not almost opposites. I'm
stymied. "Thanks Allen!" (May 19, 2010)

-- 
Eric Lease Morgan
May 31, 2009