lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mikhail Khludnev (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (LUCENE-5052) bitset codec for off heap filters
Date Wed, 09 Apr 2014 20:37:20 GMT

    [ https://issues.apache.org/jira/browse/LUCENE-5052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13964636#comment-13964636
] 

Mikhail Khludnev commented on LUCENE-5052:
------------------------------------------

bq. I think the patch looks like a good start!  Seems like we need to support a sparse bitset
form to make it more general purpose?
Agree. I wonder what's the shortest path. I see WAH8 docidset impl. Is it a good idea to take
it and move it to ByteBuffer? Or just create it in heap as-is and persist it on disk? Is it
worth to look at Elias-Fano docid set, which is not committed afaik? Or research other formats
like RLE? 
bq.  Do all lucene tests pass if you run with -Dtests.codec=BitSetCodec?
There is codec test for docs_only which pass. How other tests can pass if it doesn't support
freqs and positions? Or we need to come through all failures and triage them?
bq. Why did you use the older BlockTerms dict instead of BlockTree?
Let's check whether we can to move.


> bitset codec for off heap filters
> ---------------------------------
>
>                 Key: LUCENE-5052
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5052
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: core/codecs
>            Reporter: Mikhail Khludnev
>              Labels: features
>             Fix For: 5.0
>
>         Attachments: LUCENE-5052-1.patch, LUCENE-5052.patch, bitsetcodec.zip, bitsetcodec.zip
>
>
> Colleagues,
> When we filter we don’t care any of scoring factors i.e. norms, positions, tf, but
it should be fast. The obvious way to handle this is to decode postings list and cache it
in heap (CachingWrappingFilter, Solr’s DocSet). Both of consuming a heap and decoding as
well are expensive. 
> Let’s write a posting list as a bitset, if df is greater than segment's maxdocs/8 
(what about skiplists? and overall performance?). 
> Beside of the codec implementation, the trickiest part to me is to design API for this.
How we can let the app know that a term query don’t need to be cached in heap, but can be
held as an mmaped bitset?
> WDYT?  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message