lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jan Høydahl <>
Subject Re: Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment
Date Sat, 14 Apr 2012 11:17:34 GMT

This won't give you the performance you need, unless you have enough RAM on the Solr box to
cache the whole index in memory.
Have you tested this yourself?

Jan Høydahl, search solution architect
Cominvent AS -
Solr Training -

On 12. apr. 2012, at 15:27, Darren Govoni wrote:

> You could use SolrCloud (for the automatic scaling) and just mount a
> fuse[1] HDFS directory and configure solr to use that directory for its
> data. 
> [1]
> On Thu, 2012-04-12 at 16:04 +0300, Ali S Kureishy wrote:
>> Hi,
>> I'm trying to setup a large scale *Crawl + Index + Search *infrastructure
>> using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*,
>> crawled + indexed every *4 weeks, *with a search latency of less than 0.5
>> seconds.
>> Needless to mention, the search index needs to scale to 5Billion pages. It
>> is also possible that I might need to store multiple indexes -- one for
>> crawled content, and one for ancillary data that is also very large. Each
>> of these indices would likely require a logically distributed and
>> replicated index.
>> However, I would like for such a system to be homogenous with the Hadoop
>> infrastructure that is already installed on the cluster (for the crawl). In
>> other words, I would much prefer if the replication and distribution of the
>> Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of
>> using another scalability framework (such as SolrCloud). In addition, it
>> would be ideal if this environment was flexible enough to be dynamically
>> scaled based on the size requirements of the index and the search traffic
>> at the time (i.e. if it is deployed on an Amazon cluster, it should be easy
>> enough to automatically provision additional processing power into the
>> cluster without requiring server re-starts).
>> However, I'm not sure which Solr-based tool in the Hadoop ecosystem would
>> be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra,
>> Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is
>> mature enough and would be the right architectural choice to go along with
>> a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects
>> above.
>> Lastly, how much hardware (assuming a medium sized EC2 instance) would you
>> estimate my needing with this setup, for regular web-data (HTML text) at
>> this scale?
>> Any architectural guidance would be greatly appreciated. The more details
>> provided, the wider my grin :).
>> Many many thanks in advance.
>> Thanks,
>> Safdar

View raw message