lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jonathan Rochkind <rochk...@jhu.edu>
Subject Re: Support for huge data set?
Date Thu, 12 May 2011 19:44:14 GMT
If each document is VERY small, it's actually possible that one Solr 
server could handle it -- especially if you DON'T try to do facetting or 
other similar features, but stick to straight search and relevancy. 
There are other factors too. But # of documents is probably less 
important than total size of index, or number of unique terms -- of 
course # of documents often correlates to those too.

But if each document is largeish... yeah, I suspect that'll be too much 
for any one Solr server. You'll have to use some kind of distribution. 
Out of the box, Solr has a Distributed Search function meant for this 
use case. http://wiki.apache.org/solr/DistributedSearch  .   Some Solr 
features don't work under a Distributed setup, but the basic ones are 
there. There are some other add-ons not (yet anyway) part of Solr distro 
that try to solve this in even more sophisticated ways too, like SolrCloud.

I don't personally know of anyone indexing that many documents, although 
it is probably done. But I do know of the HathiTrust project (not me 
personally) indexing fewer documents but still adding up to terrabytes 
of total index (millions to tens of millions of documents, but each one 
is a digitized book that could be 100-400 pages), using Distributed 
Searching feature, succesfully, although it required some care and 
maintenance it wasn't just a "turn it on and it works" situation.

http://www.hathitrust.org/blogs/large-scale-search/scaling-large-scale-search-500000-volumes-5-million-volumes-and-beyond

http://www.hathitrust.org/technical_reports/Large-Scale-Search.pdf

On 5/12/2011 1:06 PM, Darren Govoni wrote:
> I have the same questions.
>
> But from your message, I couldn't tell. Are you using Solr now? Or some
> other indexing server?
>
> Darren
>
> On Thu, 2011-05-12 at 09:59 -0700, atreyu wrote:
>> Hi,
>>
>> I have about 300 million docs (or 10TB data) which is doubling every 3
>> years, give or take.  The data mostly consists of Oracle records, webpage
>> files (HTML/XML, etc.) and office doc files.  There are b/t two and four
>> dozen concurrent users, typically.  The indexing server has>  27 GB of RAM,
>> but it still gets extremely taxed, and this will only get worse.
>>
>> Would Solr be able to efficiently deal with a load of this size?  I am
>> trying to avoid the heavy cost of GSA, etc...
>>
>> Thanks.
>>
>>
>> --
>> View this message in context: http://lucene.472066.n3.nabble.com/Support-for-huge-data-set-tp2932652p2932652.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>
>

Mime
View raw message