lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joel Bernstein (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SOLR-6266) Couchbase plug-in for Solr
Date Mon, 22 Sep 2014 14:45:34 GMT

    [ https://issues.apache.org/jira/browse/SOLR-6266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143257#comment-14143257
] 

Joel Bernstein edited comment on SOLR-6266 at 9/22/14 2:45 PM:
---------------------------------------------------------------

>From my understanding the CAPIServer is listening on a port. Couchbase can be configured
to replicate a bucket to a specific host and post.  So running the CAPIServer just means that
there will be many CAPIServers running. The actual replication session will be between Couchbase
and a single CAPIServer. So in a single repication session documents will flow to one CAPIServer
and that CAPIServer will move the documents into the distributed indexing flow.

>From this scenario running a CAPIServer on all replicas really has no downside. 

But running the CAPIServer from just the leader has a couple of major downsides:

1) Leaders and replicas will change. Couchbase is pointing directly to an ip:port. If all
of sudden that node is no longer the leader then replication has stopped. If the CAPIServer
is running on all replicas then this is not an issue. 

2) If we run the CAPIServer everywhere we don't have to manage bringing CAPIServers up and
down as the leader changes. So this removes quite a bit of complexity from the design.

We don't have to worry about duplicate indexing on shards by running CAPIServers on the replicas.
If we inject the documents properly into the SolrCloud indexing flow, then SolrCloud with
ensure that documents get to the right place.

What we do have to consider very carefully though is whether we need a CAPIServer running
per Collection or per Solr node, because this effect the entire design.

My thinking is that we should have a single CAPIServer per Solr node to services all collections.
I'm assuming that the CAPIServer has thread overhead that we don't want for each collection.


But if we decide to go this route then we will need to route documents to correct collection
based on the bucket name. We'll need to also figure out how to place the CAPIServer so there
is only one per node. 







was (Author: joel.bernstein):
>From my understanding the CAPIServer is listening on a port. Couchbase can be configured
to replicate a bucket to a specific host and post.  So running the CAPIServer just means that
there will be many CAPIServers running. The actual replication session will be between Couchbase
and a single CAPIServer. So in a single repication session documents will flow to one CAPIServer
and that CAPIServer and that Solr instance move the documents into the distributed indexing
flow.

>From this scenario running a CAPIServer on all replicas really has no downside. 

But running the CAPIServer from just the leader has a couple of major downsides:

1) Leaders and replicas will change. Couchbase is pointing directly to an ip:port. If all
of sudden that node is no longer the leader then replication has stopped. If the CAPIServer
is running on all replicas then this is not an issue. 

2) If we run the CAPIServer everywhere we don't have to manage bringing CAPIServers up and
down as the leader changes. So this removes quite a bit of complexity from the design.

We don't have to worry about duplicate indexing on shards by running CAPIServers on the replicas.
If we inject the documents properly into the SolrCloud indexing flow, then SolrCloud with
ensure that documents get to the right place.

What we do have to consider very carefully though is whether we need a CAPIServer running
per Collection or per Solr node, because this effect the entire design.

My thinking is that we should have a single CAPIServer per Solr node to services all collections.
I'm assuming that the CAPIServer has thread overhead that we don't want for each collection.


But if we decide to go this route then we will need to route documents to correct collection
based on the bucket name. We'll need to also figure out how to place the CAPIServer so there
is only one per node. 






> Couchbase plug-in for Solr
> --------------------------
>
>                 Key: SOLR-6266
>                 URL: https://issues.apache.org/jira/browse/SOLR-6266
>             Project: Solr
>          Issue Type: New Feature
>            Reporter: Varun
>            Assignee: Joel Bernstein
>         Attachments: solr-couchbase-plugin.tar.gz, solr-couchbase-plugin.tar.gz
>
>
> It would be great if users could connect Couchbase and Solr so that updates to Couchbase
can automatically flow to Solr. Couchbase provides some very nice API's which allow applications
to mimic the behavior of a Couchbase server so that it can receive updates via Couchbase's
normal cross data center replication (XDCR).
> One possible design for this is to create a CouchbaseLoader that extends ContentStreamLoader.
This new loader would embed the couchbase api's that listen for incoming updates from couchbase,
then marshal the couchbase updates into the normal Solr update process. 
> Instead of marshaling couchbase updates into the normal Solr update process, we could
also embed a SolrJ client to relay the request through the http interfaces. This may be necessary
if we have to handle mapping couchbase "buckets" to Solr collections on the Solr side. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message