lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marvin Humphrey <mar...@rectangular.com>
Subject Re: Question about FieldInfos
Date Mon, 16 Jan 2006 00:43:25 GMT

On Jan 15, 2006, at 3:34 PM, Robert Kirchgessner wrote:

> There was even a patch to that problem:
>
> http://issues.apache.org/jira/browse/LUCENE-211

This is a large and somewhat hard-to-read patch.  Some stuff looks  
familiar.  Looks like he's concatenating fieldname along with  
tokentext, which is sort-of the right idea, though you need to take  
some precautions for field names of differing lengths I didn't  
immediately detect.  (KinoSearch uses field number (which corresponds  
to lexically sorted field name at index-time), encoded as a big- 
endian 16-bit int.)

The interesting thing to me is that it doesn't seem to feed an  
external sorter.  If I understand the concept correctly, he's feeding  
a sortpool for minMergeDocuments documents; creating a small mini- 
index (minMergeDocuments in size), then falling back to the primary  
merge model.  If that isn't what that patch does, well... that  
concept would still work, and it would be nice not to need an  
external sorter.

> Yes, the binary format is fully compatible to that of Lucene, as
> is the read/write/search logic.

So...

    * You use Sun's "Modified UTF-8" (not true UTF-8) to
      encode character data.
    * The VInt counts at the head of strings represent Java
      chars, not Unicode code points or bytes.
    * You've run tests with source material containing
      null bytes, Unicode characters outside the Basic
      Multilingual Plane, and corrupt character data (e.g.,
      broken UTF-8), and you are confident that indexes produced
      by Lucene and PHPLucene from such data are mutually compatible.

> By the way, though the project
> emerged as a lucene implementation in PHP I soon switched
> to writing a pure C-library with a binding to PHP. Now its
> mostly a C-project.

KinoSearch has taken a similar path of late, adding more and more XS  
(Perl's C API).

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message