lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adrien Grand <jpou...@gmail.com>
Subject Re: Compression algorithm for posting lists
Date Tue, 29 Mar 2016 08:11:45 GMT
BlockTreeTermsWriter.TermsWriter.finish writes a FST that serves as an
index of the terms dictionary. It will be used at search time when seeking
terms in the terms dictionary.

Le lun. 28 mars 2016 à 14:02, Vishwas Jain <vjvishjn@gmail.com> a écrit :

> Thanks for the reply and information.
>               I have some doubts regarding the implemenation of lucene54
> codec when writing the posting lists using the lucene50 postinglistwriter
> while going through the code. What exactly does the finish() method in the
> TermsWriter class of the BlockTreeTermsWriter.java file do? I have come to
> undertstand that the posting lists(document ID, frequency, etc) is mainly
> is mainly written using WriteBlock method in the ForUtil.java file...
>
> Thanks..
>
> On Mon, Mar 28, 2016 at 5:31 PM, Vishwas Jain <vjvishjn@gmail.com> wrote:
>
> > Thanks for the reply and information.
> >               I have some doubts regarding the implemenation of lucene54
> > codec when writing the posting lists using the lucene50 postinglistwriter
> > while going through the code. What exactly does the finish() method in
> the
> > TermsWriter class of the BlockTreeTermsWriter.java file do? I have come
> to
> > undertstand that the posting lists(document ID, frequency, etc) is mainly
> > is mainly written using WriteBlock method in the ForUtil.java file...
> >
> > Thanks..
> >
> >
> >
> >
> > On Mon, Mar 28, 2016 at 4:21 PM, Greg Bowyer <gbowyer@fastmail.co.uk>
> > wrote:
> >
> >> The posting list is compressed using a specialised technique aimed at
> >> pure numbers. Currently the codec uses a variant of Patched Frame of
> >> Reference coding to perform this compression.
> >>
> >> A good survey of such techniques can be found in the good IR books
> >> (https://mitpress.mit.edu/books/information-retrieval,
> >>
> >>
> http://www.amazon.com/Managing-Gigabytes-Compressing-Multimedia-Information/dp/1558605703
> >> ,
> >> http://nlp.stanford.edu/IR-book/) as well as this paper
> >> http://eprints.gla.ac.uk/93572/1/93572.pdf.
> >>
> >> Interestingly, there are potentially some wins in finding better integer
> >> codings (and one of my personal projects is aimed at doing exactly
> >> this), but I doubt LZ4 compressing the posting list would help all that
> >> much.
> >>
> >> Hope this helps
> >>
> >> On Mon, Mar 28, 2016, at 10:51 AM, Vishwas Jain wrote:
> >> > Hello ,
> >> >
> >> >           We are trying to implement better compression techniques in
> >> > lucene54 codec of Apache Lucene. Currently there is no such
> compression
> >> > for
> >> > posting lists in lucene54 codec but LZ4 compression technique is used
> >> for
> >> > stored fields. Does anyone know why there is no compression technique
> >> for
> >> > postings lists? and what are the possible compression that would
> benefit
> >> > if
> >> > implemented?
> >> >
> >> > Thanks
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message