lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nicolas Lalevée <>
Subject Re: Flexible index format / Payloads Cont'd
Date Sat, 05 Aug 2006 07:54:17 GMT
Le Jeudi 3 Août 2006 21:49, Marvin Humphrey a écrit :
> On Jul 31, 2006, at 8:25 AM, Nicolas Lalevée wrote:
> > That looks good, but there is one restriction : it have to be per
> > document.
> Yes, what I laid out was per-document - for each document, the fdx
> file would keep a file pointer and an integer mapping to a codec.
> > In fact I was thinking about a more generic version that will allow
> > the format
> > compatibility, keeping .fdx as is :
> >
> > FieldData (.fdt) -->  <DocFieldData>SegSize
> > DocFieldData --> FieldCount, <FieldNum, RawData>FieldCount
> >
> > And a default FieldsDataWriter will be the actual one, it will read
> > the
> > RawData as Bits, Value, with Value -->  String | BinaryValue,....
> > Then, for my app, I will provide some custom FieldsDataWriter that
> > will do
> > exactly what I want.
> OK, that's quite similar, but with the info specifying how to
> deserialize the document stored in fdt rather than fdx.

In fact, you're not obliged to put a "codec" thing. If in your app your data 
will always have the same form, then you just put the data and no codec info. 
For my use case, I would skipped the bits about compressed/binary, and I will 
only put what I want : a pointer to a type, a pointer to a lang, and the 
One important note about this design is that the index would only be read by 
my custom reader and write by my custom writter.

> However, I 
> don't think what you're describing makes the field storage in Lucene
> arbitrarily extensible, since you're just going to override
> FieldsWriter/FieldsReader rather than modify them so that they can
> use arbitrary codecs.

If you override FieldsWriter/FieldsReader, then you can put the 
writing/reading code you want, so you implement an arbitrary codec.

> I think what I want to do is turn Lucene into an Object-Oriented
> Database, or at least have Lucene adopt some characteristics of an
> ODBMS.  However, I haven't used a real ODBMS and I'm not up on the
> theory, so I can't say for sure.  I've been doing a little reading
> here and there on object databases, but I've been extraordinarily
> busy the last few weeks and haven't been able to study it in depth.
> The main point is this:
> Lucene users have diverse needs for what gets stored in the document/
> field storage.  We've been meeting those needs by assigning more and
> more bit flags.  That can't continue that ad infinitum.  However, we
> *can* meet everyone's needs by applying a variant of the "Replace
> Conditionals With Polymorphism" refactoring technique...
> (Link to
> Think of those bit flags as an if-else chain.  Instead of all those
> conditionals describing all the attributes of the Lucene Document you
> want to store at that file pointer, we allow you to put whatever kind
> of serialized object you desire there.  Maybe it's a Lucene
> Document.  Maybe it's a FrechDocument.  Maybe it's a
> RussianDocument.  Maybe it's a wrapped-up jpg.  You choose.
> Instead of continually adding to the complexity of the
> deserialization algorithm, we we make that deserialization algorithm
> user-definable.

In fact, this is exactly my point. :-)

If people thinks it is interesting, I can try to do a prototype.


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message