From "Chris Harris" <>
Subject indexing hash
Date Mon, 30 Jun 2008 18:28:59 GMT
Hey guys -


I'm considering using Nutch to do some indexing & searching over websites
and unfortunately I'm running into some roadblocks in "moving" from my
current system to Nutch.


I've reviewed the parsing code and honestly I'm a bit confused by it. so I
was hoping I could solicit some expert advice!


I've crawled a lot of pages to date (50-60 million) through other means and
have done some other work "downstream" of that based upon the words & links
on those pages.  For example a classifier which tags the document as
"sports" if it finds certain keywords in the document.


This matters because some of my downstream processing depends on our
existing hash values for the tokens in each page.


Therefore, I have a few questions:


1.	Where is *the place* (or places) to put my current hash function,
such that Nutch will index the terms in a way that will match my existing
hash functions?

a.	How does the language/encoding detection play out here?  Are all
documents indexed according to their native encoding or are they converted
to some common denominator (UTF-8 for example)?

2.	If I wanted to add a content classifier to Nutch, what's the best
way to do this?  My current assessment is:

a.	Create a class which derives from
/BasicIndexingFilter.html> BasicIndexingFilter to add a "category" attribute
to each page.
b.	Create a class which derives from BasicQueryFilter to facilitate
searching over this new category attribute.


Thanks so much for your help & the great work so far. I'm looking forward to
using this thing "for real" once I get past these issues. which feel like
they should be relatively minor!





