lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jack Krupansky" <>
Subject Re: external indexer for Solr Cloud
Date Mon, 01 Sep 2014 16:05:06 GMT
Packaging SolrCell in the same manner, with parallel threads and able to 
talk to multiple SolrCloud servers in parallel would have a lot of the same 
benefits as well.

And maybe there could be some more generic Java framework for indexing as 
well, that "external indexers" in general could use.

-- Jack Krupansky

-----Original Message----- 
From: Shawn Heisey
Sent: Monday, September 1, 2014 11:42 AM
Subject: Re: external indexer for Solr Cloud

On 9/1/2014 7:19 AM, Jack Krupansky wrote:
> It would be great to have a "standalone DIH" that runs as a separate
> server and then sends standard Solr update requests to a Solr cluster.

This has been discussed, and I thought we had an issue in Jira, but I
can't find it.

A completely standalone DIH app would be REALLY nice.  I already know
that the JDBC ResultSet is not the bottleneck for indexing, at least for
me.  I once built a simple single-threaded SolrJ application that pulls
data from JDBC and indexes it in Solr.  It works in batches, typically
500 or 1000 docs at a time.  When I comment out the "solr.add(docs)"
line (so input object manipulation, casting, and building of the
SolrInputDocument objects is still happening), it can read and
manipulate our entire database (99.8 million documents) in about 20
minutes, but if I leave that in, it takes many hours.

The bottleneck is that each DIH has only a single thread indexing to
Solr.  I've theorized that it should be *relatively* easy for me to
write an application that pulls records off the JDBC ResultSet with
multiple threads (say 10-20), have each thread figure out which shard
its document lands on, and send it there with SolrJ.  It might even be
possible for the threads to collect several documents for each shard
before indexing them in the same request.

As with most multithreaded apps, the hard part is figuring out all the
thread synchronization, making absolutely certain that thread timing is
perfect without unnecessary delays.  If I can figure out a generic
approach (with a few configurable bells and whistles available), it
might be something suitable for inclusion in the project, followed with
improvements by all the smart people in our community.


View raw message