lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thang Luong Minh" <>
Subject Advices on a replacement of Lucene gap encoding scheme?
Date Thu, 01 Feb 2007 19:04:49 GMT
Dear all,

I am happy to send my first email to Lucene community as after subscribing
to the mailing list, I haven't actually joined the community, just standing
aside and following many intersting threads.

As part of my school project, I am intending to make some improvements in
Lucene source code, and I need some advices on how significance my
modification work would be. What I am interested so far is the gap encoding
scheme in Lucene which is used in DocumentWriter.writePostings() to record
the gap positions of a term within a document. The writePostings(), in turn,
calls the writeVInt() method to record the gap, which is the byte-aligned
coding scheme.

I'm thinking of  replace the byte-aligned scheme with the "fixed binary"
coding scheme mentioned in the paper "Index compression using Fixed Binary
Codewords" by Vo Ngoc Anh and Alistair Moffat (the abstract can be found
This scheme basically breaks the list of gaps into segments whose gaps (in
one segment) will be coded in a  fixed data width w (bits). The number of
gaps in each segment is recored in a span variable s, and the pair (w,s)
form a selector assigned for that segment. By effectively decompose the
list, reduce the number of selectors into 16 combinations of relative data
size (vs. previous segment) and span, and use greedy algorithm to find
suboptimal solutions, the authors claimed that they could achieve better
compression effectiveness (measured in bits per pointer averaged across the
wholde index), and retrieval time compared to Golomb, interpolative,
byte-aligned, and word aligned code schemes.

What I wonder at this time is that in the case of Lucene, how possible it is
to implement the "fixed binary" scheme that could enhance the performance,
and whether there are other parts which I could also consider replacing the
gap-encoding scheme.

As I've started playing around with Lucene recently, I hope to have your
helps to understand Lucene better  ^_^

Best regards,

Luong Minh Thang

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message