lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Diego Ceccarelli (BLOOMBERG/ LONDON)" <dceccarel...@bloomberg.net>
Subject Re: Incorrect CollectionStatistics if IndexWriter.close is not called
Date Mon, 01 Mar 2021 19:46:28 GMT
I'm not sure that closing and opening the index writer will always work - I think the 'problem'
will be solved once the segment with the deleted document  will be merged with another segment
- that might happen during the close but might also *not* happen (e.g., if you have only one
segment, and you delete, probably closing/opening won't fix).  

Can you describe your problem that you are trying to solve? why do you need such accuracy?
if this is for some type of scoring the ranking shouldn't be affected if you have X or X-1
documents in the collection... 

Cheers,
diego

From: java-user@lucene.apache.org At: 03/01/21 16:23:48To:  Diego Ceccarelli (BLOOMBERG/ LONDON
) ,  java-user@lucene.apache.org
Subject: Re: Incorrect CollectionStatistics if IndexWriter.close is not called

Hi,

You're right the documentation of Terms.getDocCount says that "this
measure does not take deleted documents into account".
So if we want correct counts and correct query scores, the IndexWriter
has to be closed after documents are deleted/updated and a new one has
to be created when new documents arrive.

Thanks

Le dim. 28 févr. 2021 à 17:04, Diego Ceccarelli (BLOOMBERG/ LONDON)
<dceccarelli4@bloomberg.net> a écrit :
>
> I *guess* it's due to the fact that the update is implemented as remove and 
reinsert the document. Deletes in Lucene are lazy: the deleted document is just 
flagged as deleted in a bitmap and then removed from the index only when 
segments are merged.  Did you check IndexSearcher.collectionStatistic 
documentation? it should mention something about that..
>
> Cheers,
> diego
>
>
> From: java-user@lucene.apache.org At: 02/28/21 11:09:52To:  
java-user@lucene.apache.org
> Subject: Incorrect CollectionStatistics if IndexWriter.close is not called
>
> Hi,
>
> I don't understand if I'm doing something wrong or if it is the
> expected behaviour.
>
> My problem is when a document is updated the collectionStatistics
> returns counts as if a new document is added in the index, even after
> a call to IndexWriter.commit and to
> SearcherManager.maybeRefreshBlocking.
> If I call the IndexWriter.close, the counts are correct again, but the
> documentation of IndexWriter.close says to try to reuse the
> IndexWriter so I'm a bit confused.
>
> Ex:
> If I add two documents to an empty index
>
> IndexSearcher.collectionStatistics("TEXT")) returns
> "field="TEXT",maxDoc=2,docCount=2,sumTotalTermFreq=5,sumDocFreq=5" ->
> OK
>
> then I update one of the document and call commit()
>
> IndexSearcher.collectionStatistics("TEXT")) returns
> "field="TEXT",maxDoc=3,docCount=3,sumTotalTermFreq=9,sumDocFreq=9" ->
> NOK
>
> If I call close() now
>
> IndexSearcher.collectionStatistics("TEXT")) returns
> "field="TEXT",maxDoc=2,docCount=2,sumTotalTermFreq=6,sumDocFreq=6" ->
> OK
>
> Note that the counts are correct if the index contains only one document.
>
>
> I attached a test case.
>
> Am I doing something wrong somewhere?
>
>
> Julien
>
>
> ----------------------------
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message