lucene-general mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <>
Subject Re: Indexing large amount of data
Date Mon, 12 Jul 2010 20:37:17 GMT
On Mon, Jul 12, 2010 at 11:14 AM, sarfaraz masood <> wrote:

> 1) Reuse field & document objects to reduce the GC overhead using the
> field.setValue() method.. By doing this, instead of speeding up, the
> indexing speed reduced drastically. i know this is unusual but thats what
> happened.

GC overhead is much, much less on recent JVM's such as you are using.  It
still pays very large benefits to avoid *copying*, but it rarely pays to
avoid allocating.

You should look at the new TokenStream API.

> 2) Tuning parameters by  setMergeFactor(), setMaxBufferedDocs().
> now the default value for both is 10.. i increased the value to 1000.. by
> doing so the no of .CSF file in the index folder increased many folds.. and
> i got : Too Many Files Open.

Have you set this limit to the maximum possible?  It is common for the
default limit to be unreasonably small.

so where am i going wrong ?? how to overcome these to speed up
> my indexing process..
Another thing you might investigate is indexing on multiple machines in
anticipation of doing sharded search using Solr Cloud or Katta.  That will
have the largest impact on total index time of any change that you can do
relatively easily.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message