lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Carsten Schnober <schno...@ids-mannheim.de>
Subject Re: Statically store sub-collections for search (faceted search?)
Date Mon, 15 Apr 2013 09:19:04 GMT
Am 15.04.2013 10:42, schrieb Uwe Schindler:

> Not every DocIdSet supports bits(). If it returns null, then bits are not supported.
To enforce a bitset availabe use CachingWrapperFilter (which internally uses a BitSet to cache).
> It might also happen that Filter.getDocIdSet() returns null, which means that no document
matches the filter.

I've been using a ChainedFilter so far. I think this should also support
bits(), right?

> AcceptDocs in Lucene are generally all non-deleted documents. For your call to Filter.getDocIdSet
you should therefor pass AtomicReader.getLiveDocs() and not Bits.MatchAllBits.

I see. As far as I understand the documentation, getLiveDocs() returns
null if there are no deleted documents and returns the Bits matching all
available (not deleted) documents otherwise:
"Returns the Bits representing live (not deleted) docs. A set bit
indicates the doc ID has not been deleted. If this method returns null
it means there are no deleted documents."
I understand that if there are no deleted documents, I need to replace
the result (null) with Bits.MatchAllDocuments(), right? If there are
deleted documents however, I can pass on the result having all available
(not deleted) document bits set.

> You are somehow "misusing" acceptDocs and DocIdSet here, so you have to take care, semantics
are different:
> - For acceptDocs "null" means "all documents allowed" -> no deleted documents
> - For DocIdSet "null" means "no documents matched"

Okay, as described above, I would now pass either the result of
getLiveDocs() or Bits.MatchAllDocuments() as the acceptDocs argument to
getDocIdSet():

Map<Term, TermContext> termContexts = new HashMap<>();
AtomicReaderContext atomic = ...
ChainedFilter filter = ...

Bits allDocs = atomic.reader().getLiveDocs();
if (allDocs == null) {
  // no deleted documents
  allDocs = new Bits.MatchAllBits(atomic.reader().maxDoc());
}
Bits bits = filter.getDocIdSet(atomic, allDocs).bits();
if (bits == null) {
  // no documents matching filter
  continue; // skip this iteration
}
Spans spans = sq.getSpans(atomic, bits, termContexts);


> Finally: The trick here is to make Spans think that there are more deleted docs than
AtomicReader returns as deleted docs (if you would directly pass getLiveDocs() to getSpans()).
The filter is applied to the deleted docs BitSet.

Yep, I think I've tried to simulate that now. It is pretty hard to test
this systematically, so please let me know if you see an obvious flaw in
my code. Thanks!
Best,
Carsten

-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP                 | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789      | schnober@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message