lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shalin Shekhar Mangar <shalinman...@gmail.com>
Subject Re: Indexing and searching of sharded/ partitioned databases and tables
Date Wed, 07 Oct 2009 09:51:56 GMT
Comments inline:

On Wed, Oct 7, 2009 at 2:01 PM, Jayant Kumar Gandhi <jayantg@gmail.com>wrote:

>
> Lets say I have 3 mysql databases each with 3 tables.
>
> Db1 : Tbl1, Tbl2, Tbl3
> Db2 : Tbl1, Tbl2, Tbl3
> Db3 : Tbl1, Tbl2, Tbl3
>
> All databases have the same number of tables with same table names as
> shown above. All tables have exactly the same structure as well. Each
> table has three fields:
> id, name, category
>
> Since the data is distributed this way, I don't have a way to search
> for a particular record using 'name'. I must look for it in all the 9
> tables. This is not scalable when lets say I have 20 databases each
> with 20 tables, meaning 400 queries needed to find a single record.
>
> Solr seemed like the solution to help.
>
> I followed the wiki tutorials:
> http://wiki.apache.org/solr/DataImportHandler
> http://wiki.apache.org/solr/DIHQuickStart
> http://wiki.apache.org/solr/DataImportHandlerFaq
>
> The following are my config files so far:
> ================
> solrconfig.xml
> ================
> <requestHandler name="/dataimport"
> class="org.apache.solr.handler.dataimport.DataImportHandler">
> <lst name="defaults">
>  <str name="config">data-config.xml</str>
> </lst>
> </requestHandler>
>
> ================
> dataconfig.xml (so far)
> ================
> <dataConfig>
>  <dataSource type="JdbcDataSource" name="ds1"
> driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost/Db1"
> user="user-name" password="password" />
>  <dataSource type="JdbcDataSource" name="ds2"
> driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost/Db2"
> user="user-name" password="password" />
>  <dataSource type="JdbcDataSource" name="ds3"
> driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost/Db3"
> user="user-name" password="password" />
>  <document>
>    <entity name="record11" dataSource="ds1" query="select
> id,name,category from Tbl1"></entity>
>    <entity name="record12" dataSource="ds1" query="select
> id,name,category from Tbl2"></entity>
>    <entity name="record13" dataSource="ds1" query="select
> id,name,category from Tbl3"></entity>
>    <entity name="record21" dataSource="ds2" query="select
> id,name,category from Tbl1"></entity>
>    <entity name="record22" dataSource="ds2" query="select
> id,name,category from Tbl2"></entity>
>    <entity name="record23" dataSource="ds2" query="select
> id,name,category from Tbl3"></entity>
>    <entity name="record31" dataSource="ds3" query="select
> id,name,category from Tbl1"></entity>
>    <entity name="record32" dataSource="ds3" query="select
> id,name,category from Tbl2"></entity>
>    <entity name="record33" dataSource="ds3" query="select
> id,name,category from Tbl3"></entity>
>  </document>
> </dataConfig>
>
> ================
> Doubts/ Questions:
> ================
>
> - Is this the right away to achieve indexing this data?
> - Is there a better way to achieve this? Imagine 20 databases with 20
> tables each translates to 400 lines in the XML. This doesn't scale for
> something like 200 databases and 200 tables each. Will solr continue
> to work/ index properly if I had 40000 entity rows without going out
> of memory?
>

Seems OK. Your original database is sharded so I'm guessing the amount of
data is quite large. The number of root entities does not matter. What
matters is the total number of documents. As you go from indexing 20
database shards to 200 shards, you will likely cross a point where indexing
all of them on a single Solr box is either impossible (due to the large
number of documents) or very slow. Similarly, response times may also
suffer.

Solr supports distributed search wherein you can shard your Solr index each
having a disjoint set of documents. You can continue to query Solr normally
(except for providing an additional shards request parameter) and Solr will
make sure it gets results from all shards, merges and returns them as if you
were querying a single Solr instance.

See http://wiki.apache.org/solr/DistributedSearch for more details.


> - I will really want that I can search thru the complete database for
> a 'name' and do things like 'category' filtering etc easily
> independent of the entity name/ datasource. For me they are all
> records of the same type.
>
>
That is very much possible out of the box.

-- 
Regards,
Shalin Shekhar Mangar.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message