lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ard Schrijvers" <>
Subject RE: Does Index have a Tokenizer Built into it
Date Fri, 13 Jul 2007 07:58:18 GMT

> I'm wondering if after 
> opening the
> index I can retrieve the Tokens (not the terms) of a 
> document, something
> akin to IndexReader.Document(n).getTokenizer().

It is obviously not possible to get the original tokens of the document back when you haven't
stored the document, because:

1) the analyzer might have removed stop words in the first place
2) the terms in lucene index are perhaps stemmed words / synonyms / etc etc
3) how would you expect things like spaces, commas, dots etc to be restored?

And, I think what you want does not comply with an inverted index. When you do not store the
document, you always loose information about the document during indexing/analyzing

How many documents are you talking about? They must be either somewhere on FS or accessible
over http...when you need the document, why not just provide a link to the original location?

Regards Ard

> In summary:
> My current ( too wasteful implementation is this)
> StandardTokenizer(BufferedReader (  
> IndexReader.Document(n).getField("text"
> )  )
> I'm wondering if Lucene has a more efficient manner to 
> retrieve the tokens
> of a document from an index.  Because it seems like it has 
> information about
> every "term" already, Since you can get retrieve a 
> TermPositions object.
> Thanks,
> --JP

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message