lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Lee Chunki <lck7...@coupang.com>
Subject Re: external indexer for Solr Cloud
Date Tue, 02 Sep 2014 04:57:44 GMT
Hi,

@Jack
the final goal is generate index out of Solr Cloud but run DIH externally is not bad

@Shawn
it sounds great to build a new application that work with multiple threads and send documents
to their shards
please let me know the logic how can i decide which document should go to a shard ( i.e. matching
rule for document and shard  ) 

Thanks,
Chunki.

On Sep 2, 2014, at 1:15 AM, Siegfried Goeschl <sgoeschl@gmx.at> wrote:

> Hi folks,
> 
> we are using Apache Camel but could use Spring Integration with the option to upgrade
to Apache BatchEE or Spring Batch later on - especially Tikka document extraction can kill
you server due to CPU consumption, memory usage and plain memory leaks
> 
> AFAIK Douf Turnbull also improved the Camel Solr Integration
> 
> http://comments.gmane.org/gmane.comp.jakarta.lucene.solr.user/99739
> 
> Cheers,
> 
> Siegfried Goeschl
> 
> On 01.09.14 18:05, Jack Krupansky wrote:
>> Packaging SolrCell in the same manner, with parallel threads and able to
>> talk to multiple SolrCloud servers in parallel would have a lot of the
>> same benefits as well.
>> 
>> And maybe there could be some more generic Java framework for indexing
>> as well, that "external indexers" in general could use.
>> 
>> -- Jack Krupansky
>> 
>> -----Original Message----- From: Shawn Heisey
>> Sent: Monday, September 1, 2014 11:42 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: external indexer for Solr Cloud
>> 
>> On 9/1/2014 7:19 AM, Jack Krupansky wrote:
>>> It would be great to have a "standalone DIH" that runs as a separate
>>> server and then sends standard Solr update requests to a Solr cluster.
>> 
>> This has been discussed, and I thought we had an issue in Jira, but I
>> can't find it.
>> 
>> A completely standalone DIH app would be REALLY nice.  I already know
>> that the JDBC ResultSet is not the bottleneck for indexing, at least for
>> me.  I once built a simple single-threaded SolrJ application that pulls
>> data from JDBC and indexes it in Solr.  It works in batches, typically
>> 500 or 1000 docs at a time.  When I comment out the "solr.add(docs)"
>> line (so input object manipulation, casting, and building of the
>> SolrInputDocument objects is still happening), it can read and
>> manipulate our entire database (99.8 million documents) in about 20
>> minutes, but if I leave that in, it takes many hours.
>> 
>> The bottleneck is that each DIH has only a single thread indexing to
>> Solr.  I've theorized that it should be *relatively* easy for me to
>> write an application that pulls records off the JDBC ResultSet with
>> multiple threads (say 10-20), have each thread figure out which shard
>> its document lands on, and send it there with SolrJ.  It might even be
>> possible for the threads to collect several documents for each shard
>> before indexing them in the same request.
>> 
>> As with most multithreaded apps, the hard part is figuring out all the
>> thread synchronization, making absolutely certain that thread timing is
>> perfect without unnecessary delays.  If I can figure out a generic
>> approach (with a few configurable bells and whistles available), it
>> might be something suitable for inclusion in the project, followed with
>> improvements by all the smart people in our community.
>> 
>> Thanks,
>> Shawn
> 


Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message