lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <>
Subject Re: Large RDBMS dataset
Date Wed, 14 Dec 2011 18:48:11 GMT
You can also consider using SolrJ to do this. I posted a small example a couple
of days ago.


On Wed, Dec 14, 2011 at 10:39 AM, Gora Mohanty <> wrote:
> On Wed, Dec 14, 2011 at 3:48 PM, Finotti Simone <> wrote:
>> Hello,
>> I have a very large dataset (> 1 Mrecords) on the RDBMS which I want my Solr application
to pull data from.
> [...]
>> It works, but it takes 1'38" to parse 100 records: it means 1 rec/s! That means that
digesting the whole dataset would take 1 Ms (=> 12 days).
> Depending on the size of the data that you are pulling from
> the database, 1M records is not really that large a number.
> We were doing ~75GB of stored data from ~7million records
> in about 9h, including quite complicated transfomers. I would
> imagine that there is much room for improvement in your case
> also. Some notes on this:
> * If you have servers to throw at the problem, and a sensible
>  way to shard your RDBMS data, use parallel indexing to
>  multiple Solr cores, maybe on multiple servers, followed by
>  a merge. In our experience, given enough RAM and adequate
>  provisioning of database servers, indexing speed scales linearly
>  with the total no. of cores.
> * Replicate your database, manually if needed. Look at the load
>  on a database server during the indexing process, and provision
>  enough database servers to match the no. of Solr indexing servers.
> * This point is leading into flamewar territory, but consider switching
>   databases. From our (admittedly non-rigorous measurements),
>   mysql was at least a factor of 2-3 faster than MS-SQL, with the
>   same dataset.
> * Look at cloud-computing. If finances permit, one should be able
>  to shrink indexing times to almost any desired level. E.g., for the
>  dataset that we used, I have little doubt that we could have shrunk
>  the time down to less than 1h, at an affordable cost on Amazon EC2.
>  Unfortunately, we have not yet had the opportunity to try this.
>> The problem is that for each record in "fd", Solr makes three distinct SELECT on
the other three tables. Of course, this is absolutely inefficient.
>> Is there a way to have Solr loading every record in the four tables and join them
when they are already loaded in memory?
> For various reasons, we did not investigate this in depth,
> but you could also look at Solr's CachedSqlEntityProcessor.
> Regards,
> Gora

View raw message