nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Howie Wang" <>
Subject Ideas for enhancements
Date Sun, 19 Jun 2005 15:08:54 GMT

There was some functionality in Nutch that I've either implemented,
or am planning to implement, and I was curious if other people were
interested so that maybe the changes could get into the main line.

1. A String[] HitDetails.getValues(String field) method that
returns an array of the values. The current only returns a
single string, and Lucene indexes can have multiple values
per field.

2. In, put in a field (parentURL) for the URL of the page that
contains the link. Right now it seems we just have the links themselves
and we can't backtrack where they come from. Being able to backtrack
through the links is handy for doing something like categorization. For
example, you see that all the links are coming from a page about poodles,
so you might categorize the linked page as a poodle page. It might also
come in handy for doing something like a Google TrustRank scoring, where
you penalize certain sites if they're a known link farm, or boost them if 
from some place respected like DMOZ.

3. Get sorting to work on multiple fields. Lucene already works on
multiple fields so it shouldn't be difficult to get this working. Just
change the places where is passes down String field so that it
accepts an array. The sort fields could be read from the query
string in order:


Is anybody interested in these things? It would be nice to get them
merged into the main code.


View raw message