lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Blargy <>
Subject Re: Importing large datasets
Date Wed, 02 Jun 2010 15:44:28 GMT

As a data point, I routinely see clients index 5M items on normal
> hardware in approx. 1 hour (give or take 30 minutes).  

Our master solr machine is running 64-bit RHEL 5.4 on dedicated machine with
4 cores and 16G ram so I think we are good on the hardware. Our DB is MySQL
version 5.0.67 (exact stats i don't know of the top of my head)

When you say "quite large", what do you mean?  Are we talking books here or
maybe a couple pages of text or just a couple KB of data?

Our item descriptions are very similar to an ebay listing and can include
HTML. We are talking about a couple of pages of text.

How long does it take you to get that data out (and, from the sounds of it,
merge it with your item) w/o going to Solr? 

I'll have to get back to you on that one.

DataImportHandler now supports multiple threads. 

When you say "now", what do you mean? I am running version 1.4.

The absolute fastest way that I know of to index is via multiple threads
sending batches of documents at a time (at least 100)

 Is there a wiki explaining how this multiple thread process works? Which
batch size would work best? I am currently using a -1 batch size. 

You may want to write your own multithreaded client to index. 

This sounds like a viable option. Can you point me in the right direction on
where to begin (what classes to look at, prior examples, etc)?

Here is my field type I am using for the item description. Maybe its not the

  <fieldType name="text" class="solr.TextField" omitNorms="false">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.WordDelimiterFilterFactory"
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>

Here is an overview of my data-config.xml. Thoughts?

 <entity name="item" 
            query="select * from items">
    <entity name="item_description" 
                query="select description from item_descriptions where

I appreciate the help.
View this message in context:
Sent from the Solr - User mailing list archive at

View raw message