lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ali S Kureishy <safdar.kurei...@gmail.com>
Subject Options for automagically Scaling Solr (without needing distributed index/replication) in a Hadoop environment
Date Thu, 12 Apr 2012 13:04:10 GMT
Hi,

I'm trying to setup a large scale *Crawl + Index + Search *infrastructure
using Nutch and Solr/Lucene. The targeted scale is *5 Billion web pages*,
crawled + indexed every *4 weeks, *with a search latency of less than 0.5
seconds.

Needless to mention, the search index needs to scale to 5Billion pages. It
is also possible that I might need to store multiple indexes -- one for
crawled content, and one for ancillary data that is also very large. Each
of these indices would likely require a logically distributed and
replicated index.

However, I would like for such a system to be homogenous with the Hadoop
infrastructure that is already installed on the cluster (for the crawl). In
other words, I would much prefer if the replication and distribution of the
Solr/Lucene index be done automagically on top of Hadoop/HDFS, instead of
using another scalability framework (such as SolrCloud). In addition, it
would be ideal if this environment was flexible enough to be dynamically
scaled based on the size requirements of the index and the search traffic
at the time (i.e. if it is deployed on an Amazon cluster, it should be easy
enough to automatically provision additional processing power into the
cluster without requiring server re-starts).

However, I'm not sure which Solr-based tool in the Hadoop ecosystem would
be ideal for this scenario. I've heard mention of Solr-on-HBase, Solandra,
Lily, ElasticSearch, IndexTank etc, but I'm really unsure which of these is
mature enough and would be the right architectural choice to go along with
a Nutch crawler setup, and to also satisfy the dynamic/auto-scaling aspects
above.

Lastly, how much hardware (assuming a medium sized EC2 instance) would you
estimate my needing with this setup, for regular web-data (HTML text) at
this scale?

Any architectural guidance would be greatly appreciated. The more details
provided, the wider my grin :).

Many many thanks in advance.

Thanks,
Safdar

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message