lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From MitchK <>
Subject Solr and Nutch/Droids - to use or not to use?
Date Sat, 12 Jun 2010 11:41:53 GMT

Hello community and a nice satureday,

from several discussions about Solr and Nutch, I got some questions for a
virtual web-search-engine.

The requirements:
I. I need a scalable solution for a growing index that becomes larger than
one machine can handle. If I add more hardware, I want to linear improve the

II. I want to use technologies like the OPIC-algorithm (default algorithm in
Nutch) or PageRank or... whatever is out there to improve the ranking of the

III. I want to be able to easily add more fields to my documents. Imagine
one retrives information from a webpage's content, than I want to make it

IV. While fetching my data, I want to make special-searches possible. For
example I want to retrive pictures from a webpage and want to index
picture-related content into another search-index plus I want to save a
small thumbnail of the picture itself. Btw: This is (as far as I know) not
possible with solr, because solr was not intended to do such special

V. I want to use filter queries (i.e. main-query "christopher lee" returns
1.5mio results, subquery "action" -> the main-query would be a filter-query
and "action" would be the actual query. So a search within search-results
would be easily made available).

VI. I want to be able to use different logics for different pages. Maybe I
got a pool of 100 domains that I know better than others and I got special
scripts that retrive more special information from those 100 domains. Than I
want to apply my special logic to those 100 domains, but every other domain
should use the default logic.


The project is only virtual. So why I am asking?
I want to learn more about websearch and I would like to make some new

What do I know about Solr + Nutch:
As it is said on, Solr + Nutch does not scale if the
index is too large.
The article was a little bit older and I don't know whether this problem
gets fixed with the new distributed abilities of Solr.

Furthermore I don't want to index the pages with nutch and reindex them with
The only exception would be: If the content of a webpage get's indexed by
nutch, I want to use the already tokenized content of the body with some
Solr copyfield operations to extend the search (i.e. making fuzzy search
possible). At the moment: I don't think this is possible.

I don't know much about the droids project and how well it is documented.
But from what I can read by some posts of Otis, it seems to be usable as a

Pros for Nutch are: It is very scalable! Thanks to hadoop and MapReduce it
is a scaling-monster (from what I've read).

Cons: The search is not as rich as it is possible with Solr. Extend Nutch's
search-abilities *seems* to be more complicated than with Solr. Furthermore,
if I want to use Solr to search nutch's index, looking at my requirements I
would need to reindex the whole thing - without the benefits of Hadoop. 

What I don't know at the moment is, how it is possible to use algorithms
like in II. mentioned with Solr.

I hope you understand the problem here - Solr *seems* to me as it would not
be the best solution for a web-search-engine, because of scaling reasons in

Where should I dive deeper? 
Solr + Droids?
Solr + Nutch?
Nutch + howToExtendNutchToMakeSearchBetter?

Thanks for the discussion!
- Mitch
View this message in context:
Sent from the Solr - User mailing list archive at

View raw message