lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <>
Subject Re: Dmitry's Term Vector stuff, plus some
Date Tue, 17 Feb 2004 20:40:21 GMT
Grant Ingersoll wrote:
> I agree with your assessment about getting it right the first time.  I can make the changes,
as I don't think they are that involved and it will benefit me and my employer in the long
run if the changes are committed since we won't have reapply the patches every time there
is a new release.  

Great!  Thanks.

> It would really speed things up if you can point me to examples of writing the version
number (and the logic for ignoring someting of the wrong version) and the compressed format.

The new TermInfosWriter code writes FORMAT, the current version number. 
  This is read by SegmentTermEnum.  This is not a great example, since 
the previous file format didn't support a version number.  I added it by 
using negative numbers for the version number so that it can be 
distinguished from any valid value at the start of the old format.  It 
will be easier in your case, since back-compatibility is not yet an issue.

In general, the idea is to store a file format version as the first four 
bytes of each file, e.g., something like:

class MyWriter {
   public static final int FORMAT = 1;

   public write(OutputStream out) {


class MyReader {
   public void read(InputStream in) {
     int format = in.readInt(in);

     if (format > MyWriter.FORMAT) {
       throw new IOException("Unknown format: " + format);

     if (format == 0) {
        ... back-compatibility stuff for format 0
     } else {
       ...  stuff for current version

As for prefix compression of strings, check out 
TermInfosWriter#writeTerm() and SegmentTermEnum#readTerm().  Since 
vectors contain lexicographically sorted lists of terms in the same 
field, you can use the same technique.

Hope that helps,


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message