lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki>
Subject Re: Importing large datasets
Date Wed, 02 Jun 2010 10:53:54 GMT
On 2010-06-02 12:42, Grant Ingersoll wrote:
> On Jun 1, 2010, at 9:54 PM, Blargy wrote:
>> We have around 5 million items in our index and each item has a description
>> located on a separate physical database. These item descriptions vary in
>> size and for the most part are quite large. Currently we are only indexing
>> items and not their corresponding description and a full import takes around
>> 4 hours. Ideally we want to index both our items and their descriptions but
>> after some quick profiling I determined that a full import would take in
>> excess of 24 hours. 
>> - How would I profile the indexing process to determine if the bottleneck is
>> Solr or our Database.
> As a data point, I routinely see clients index 5M items on normal
> hardware in approx. 1 hour (give or take 30 minutes).  
> When you say "quite large", what do you mean?  Are we talking books here or maybe a couple
pages of text or just a couple KB of data?
> How long does it take you to get that data out (and, from the sounds of it, merge it
with your item) w/o going to Solr?
>> - In either case, how would one speed up this process? Is there a way to run
>> parallel import processes and then merge them together at the end? Possibly
>> use some sort of distributed computing?
> DataImportHandler now supports multiple threads.  The absolute fastest way that I know
of to index is via multiple threads sending batches of documents at a time (at least 100).
 Often, from DBs one can split up the table via SQL statements that can then be fetched separately.
 You may want to write your own multithreaded client to index.

SOLR-1301 is also an option if you are familiar with Hadoop ...

Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration  Contact: info at sigram dot com

View raw message