lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From House Less <>
Subject Retrieving the term vectors of a document in Nutch
Date Mon, 08 Jun 2009 00:14:17 GMT

Hello everyone,

I am quite new to development with Nutch, so you must forgive my question if it is amateurish.

After some reading of Luke's source code, I found to my dismay that obtaining the TermFreqVector
of a document via the IndexReader resulted in no vectors at all. A mailing list entry found
via Google said that Nutch does not store the contents of a page in its Lucene indices. This
makes sense.

I then read the Nutch source code and figured out that one could use NutchBean to reconstruct
the parsed text of an indexed page.

However, this still left the nagging problem of retrieving the TermFreqVector for the parsed
text of a page. I tried MoreLikeThis to retrieve the set of terms but that did not work either;
it was simply empty. The source code to MoreLikeThis suggests certain assumptions made on
the Lucene indices being accessed.

At the end of the day, I simply decided to reconstruct the term frequency vector of a page
by referring to TermDocs in the IndexReader. This is not very efficient since I have to do
this for every page iterated over the Lucene document index.

I wonder whether it is possible to retrieve previously computed TermFreqVector[] of a document
in Nutch's Lucene indices? Could it be that this is not possible because Nutch does not store
the TermFreqVector[]? Your insights on the matter will help.



To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message