lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mark Harwood (JIRA)" <>
Subject [jira] Commented: (LUCENE-584) Decouple Filter from BitSet
Date Mon, 03 Dec 2007 17:06:43 GMT


Mark Harwood commented on LUCENE-584:

To go back to post #1 on this topic:

   _"Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It
would be desirable to have an alternative BitSet implementation with smaller memory footprint."_

Given the motivation to move to more memory efficient structures  why is the only attempt
at caching dedicated exclusively to caching the very structures we were trying to move away

       _"I deprecated also CachingWrapperFilter and RemoteCachingWrapperFilter and added corresponding
CachingBitSetFilter and RemoteCachingBitSetFilter"_

Does this suggest we are to have type-specific CachingXxxxxFilters and RemoteCachingXxxxxFilters
created for every new filter type? Why not provide a single caching mechanism that works for
all those other, new, more memory-efficient structures? I beleive the reason this hasn't been
done is due to the issue I highlighted earlier - the cachable artefacts (what I chose to call
"DocIdSet" here: [#action_12518642] ) are not modelled in  a way which promotes re-use. That's
why we would end up needing a specialised caching implementations for each type. 

If we are to move forward from the existing Lucene implementation it's important to note the

* Filters currently produce, at great cost, BitSets. Bitsets provide both a cachable data
structure and a thread-safe, reusable  means of iterating across the contents.

* By replacing BitSets with Matchers this proposal has removed an important aspect of the
existing design -  the visibility (and therefore cachability) of these expensive-to-recreate
data structures. Matchers are single-use, non-threadsafe objects and hide the data structure
over which they iterate. With this change if I want to implement a caching mechanism in my
application I need to know the Filter type and what sort of data structure it returns and
get it from it directly:
  if(myFilter instanceof BitSetFilter)    wrap specific data structure using CachingBitSetFilter
  if(myFilter instanceof OpenBitSetFilter)   wrap specific data structure using CachingXxxxxFilter

...looks like an Anti-pattern to me. Worse, this ties the choice of datastructure to the type
of Filter that produces it. Why can't my RangeFilter be free to create a SortedVIntList or
a BitSet depending on the sparseness of matches for a particular set of criteria?

I'm not saying "lets just stick with Bitsets", just consider caching more in the design. Post
[#action_12518642] lays out how this could be modelled with the introduction of DocIdSet and
DocIdSetIterator as separate responsibilities (whereas Matcher currently combines them both).


> Decouple Filter from BitSet
> ---------------------------
>                 Key: LUCENE-584
>                 URL:
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.0.1
>            Reporter: Peter Schäfer
>            Assignee: Michael Busch
>            Priority: Minor
>         Attachments: bench-diff.txt, bench-diff.txt, lucene-584-take2.patch, lucene-584.patch,
Matcher-20070905-2default.patch, Matcher-20070905-3core.patch, Matcher-20071122-1ground.patch,
> {code}
> package;
> public abstract class Filter implements 
> {
>   public abstract AbstractBitSet bits(IndexReader reader) throws IOException;
> }
> public interface AbstractBitSet 
> {
>   public boolean get(int index);
> }
> {code}
> It would be useful if the method =Filter.bits()= returned an abstract interface, instead
of =java.util.BitSet=.
> Use case: there is a very large index, and, depending on the user's privileges, only
a small portion of the index is actually visible.
> Sparsely populated =java.util.BitSet=s are not efficient and waste lots of memory. It
would be desirable to have an alternative BitSet implementation with smaller memory footprint.
> Though it _is_ possibly to derive classes from =java.util.BitSet=, it was obviously not
designed for that purpose.
> That's why I propose to use an interface instead. The default implementation could still
delegate to =java.util.BitSet=.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message