lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erik Fäßler <>
Subject Re: Serialization of Lucene Document objects
Date Tue, 22 Feb 2011 14:53:36 GMT
  Hi Simon,

thanks for your answer. My comments below:
> so you mean you would want to do that analysis on the client side and
> only shoot the already tokenized values to the server?
> What exactly is too slow? Can you provide more info what the problem is?
> After all I think you should ask on the solr-user list instead.
The point is, I'm using some quite sophisticated NLP pipeline which 
outputs data I'd like to index. I have a component which maps this data 
structure (actually a UIMA CAS object) to lucene documents. A lot of 
things are done with the input data, including some quite custom 
adaptions of position_increments and aligning several other TokenStreams 
in terms of again position_increment and position_offset.
I cannot do such things with the native Solr XML format because I need 
several fields with the same name but different indexing / storing 
options. This is because I enrich my documents' texts with meta data 
extracted by my pipeline. So a field gets much more terms then could 
have been extracted from the text by Lucene/Solr analysis. Solr 
approximates this capability by multi-valued fields but this can't work 
the same.

I measured the timings for batches of 1000 documents sent to Solr.  I am 
sending the whole UIMA CAS in a serialized form which is a quite verbose 
XMI format.
Processing 1000 documents in Solr then takes
* approx. 11sec for deserialization
* approx. 4sec for mapping to a document
* less then 1sec for writing the documents to the index.

So most of the times gets lost by deserialization. By sending the Lucene 
documents directly I hope to reduce this overhead greatly as I'm not 
sending the verbose raw data but an already condensed form.
Second, the time for the mapping still takes some time for work which 
not necessarily has to be done on the server side. I can scale the 
clients arbitrarily so they should do most of the work.

This is why I'd like to build the Lucene documents on the client side 
and just send them to server. But now I wonder if this is possible at 
all after the serialization of lucene documents failed...

Sorry for the long read and thanks for you help :)

> Simon
>> Thanks for any hints!
>> Regards,
>>     Erik
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail:
>> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message