lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shawn Heisey <>
Subject Re: DIH to index the data - 250 millions - Need a best architecture
Date Mon, 29 Jul 2013 15:03:37 GMT
On 7/29/2013 6:00 AM, Santanu8939967892 wrote:
> Hi,
>    I have a huge volume of DB records, which is close to 250 millions.
> I am going to use DIH to index the data into Solr.
> I need a best architecture to index and query the data in an efficient
> manner.
> I am using windows server 2008 with 16 GB RAM, zion processor and Solr 4.4.

Gora and Jack have given you great information.  I would add that when
you are dealing with an index of this size, you need to be prepared to
spend some real money on hardware if you want maximum performance.

With 20-30 fields, I would imagine that each document is probably a few
KB in size.  Even if they will be much smaller than that, with 250
million of them, your index will be pretty large.

I'd be VERY surprised if the index is less than 100GB, and something
larger than 500GB is probably more likely.  For illustration purposes,
let's be conservative and say it's 200GB.

16GB of RAM isn't enough for an index that size.  An ideal "round"
memory size for a 200GB index would be 256GB - 200GB of RAM for the OS
disk cache and enough memory for whatever size java heap you might need.
 In truth, you probably don't need to cache the ENTIRE index ... most
searches will involve only certain parts of the index and won't touch
the entire thing.  A "good enough" memory size might be 128GB which
would keep the most relevant parts of the index in RAM at all times.

If you were to put a 200GB index onto a disk that's SSD, you could
probably get away with 64GB of RAM - 50GB or so for the OS disk cache
and the rest for the java heap.

If your index will be larger than 200GB, then the numbers I have given
you will go up.  These numbers also assume that you have your entire
index on one server, which is probably not a good idea.

SolrCloud would likely be the best architecture.  It would spread out
your system requirements and load across multiple machines.  If you had
20 machines, each with 16-32GB of RAM, you could do a SolrCloud
installation with 10 shards and a replicationFactor of 2, and there
wouldn't be any memory problems.  Each machine would have 25 million
records on it, and you'd have two complete copies of your index so you'd
be able to keep running if a machine completely failed -- which DOES happen.

The information I've given you is for an ideal setup.  You can go
smaller, and budget needs might indeed cause you to go smaller.  If you
don't need extremely good performance from Solr, then you don't need to
spend the money required for an architecture like I've described.


View raw message