lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Peter Keegan" <>
Subject Re: term counts during indexing
Date Thu, 30 Oct 2003 16:52:36 GMT
As I understand it, the field text is being tokenized by the analyzer when
IndexWriter.addDocument is called. At this point, the tokens are indexed
and/or stored. Would it be possible for 'addDocument' to save and make the
_actual_ counts of 'tokens stored' and 'tokens indexed' available in either
the Document or IndexWriter object? I guess I may be turning this into a
feature request :)

Also, I can't find this method from the code snippit provided by Gerret (I'm
using v1.2):
> String[] fieldTerms = doc.getValues(fieldName);


----- Original Message ----- 
From: "Gerret Apelt" <>
To: "Lucene Users List" <>
Sent: Wednesday, October 29, 2003 9:44 PM
Subject: Re: term counts during indexing

> Peter Keegan wrote:
> >Is there a simple and efficient way of determining the number of tokens
> >added
> >to a document after adding each field ('Document.add), as a result of the
> >actions
> >of the Analyzer, without having to re-parse the field
> Peter --
> you can ask the Document instance.
> Document doc = getDocumentInstanceFromSomewhere();
> int termCount = 0;
> Enumertion fields = doc.fields();
> while (fields.hasMoreElements()) {
>     Field field = (Field)fields.nextElement();
>     String fieldName =;
>     String[] fieldTerms = doc.getValues(fieldName);
>     termCount += fieldTerms.length;
> }
> System.out.println("The fields of the document together contain
> "+termCount+" terms.");
> Note that
> 1) I haven't tried to compile this code, so I'm not sure if it works
> 2) this will only work for those fields where field.isStored() == true.
> If the field isnt stored in the index, then you don't have a choice but
> to go back to the document.
> [not sure on the following, so please correct me if in error:] Remember
> that unStored fields are indexed, so you can query on them, but the
> field terms themselves are not stored in the index. Therefore you cannot
> count them by asking Lucene. A Lucene field instance also has no way to
> reference the source of the terms that are added to it. The field
> doesn't care where its terms came from. So if field.isStored() == false,
> then for that particular field Lucene cannot tell you how many terms are
> in it. You'll have to write your own code that analyzes the original
> data source in this case.
> >Alternatively, is there a way to determine the number of tokens added
> >adding the document to the index ('IndexWriter.addDocument')?
> >
> >
> Whether you want the termCount for a document before or after you add
> the document to the index doesn't matter, so the answer is "see above".
> cheers,
> Gerret
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message