lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From MitchK <mitc...@web.de>
Subject Re: Solr and Nutch/Droids - to use or not to use?
Date Mon, 14 Jun 2010 13:30:47 GMT

Just wanted to push the topic a little bit, because those question come up
quite often and it's very interesting for me.

Thank you!

- Mitch


MitchK wrote:
> 
> Hello community and a nice satureday,
> 
> from several discussions about Solr and Nutch, I got some questions for a
> virtual web-search-engine.
> 
> The requirements:
> I. I need a scalable solution for a growing index that becomes larger than
> one machine can handle. If I add more hardware, I want to linear improve
> the performance.
> 
> II. I want to use technologies like the OPIC-algorithm (default algorithm
> in Nutch) or PageRank or... whatever is out there to improve the ranking
> of the webpages. 
> 
> III. I want to be able to easily add more fields to my documents. Imagine
> one retrives information from a webpage's content, than I want to make it
> searchable.
> 
> IV. While fetching my data, I want to make special-searches possible. For
> example I want to retrive pictures from a webpage and want to index
> picture-related content into another search-index plus I want to save a
> small thumbnail of the picture itself. Btw: This is (as far as I know) not
> possible with solr, because solr was not intended to do such special
> indexing-logic.
> 
> V. I want to use filter queries (i.e. main-query "christopher lee" returns
> 1.5mio results, subquery "action" -> the main-query would be a
> filter-query and "action" would be the actual query. So a search within
> search-results would be easily made available).
> 
> VI. I want to be able to use different logics for different pages. Maybe I
> got a pool of 100 domains that I know better than others and I got special
> scripts that retrive more special information from those 100 domains. Than
> I want to apply my special logic to those 100 domains, but every other
> domain should use the default logic.
> 
> -----------------
> 
> The project is only virtual. So why I am asking?
> I want to learn more about websearch and I would like to make some new
> experiences.
> 
> What do I know about Solr + Nutch:
> As it is said on lucidimagination.com, Solr + Nutch does not scale if the
> index is too large.
> The article was a little bit older and I don't know whether this problem
> gets fixed with the new distributed abilities of Solr.
> 
> Furthermore I don't want to index the pages with nutch and reindex them
> with solr. 
> The only exception would be: If the content of a webpage get's indexed by
> nutch, I want to use the already tokenized content of the body with some
> Solr copyfield operations to extend the search (i.e. making fuzzy search
> possible). At the moment: I don't think this is possible.
> 
> I don't know much about the droids project and how well it is documented.
> But from what I can read by some posts of Otis, it seems to be usable as a
> crawler-framework.
> 
> 
> Pros for Nutch are: It is very scalable! Thanks to hadoop and MapReduce it
> is a scaling-monster (from what I've read).
> 
> Cons: The search is not as rich as it is possible with Solr. Extend
> Nutch's search-abilities *seems* to be more complicated than with Solr.
> Furthermore, if I want to use Solr to search nutch's index, looking at my
> requirements I would need to reindex the whole thing - without the
> benefits of Hadoop. 
> 
> What I don't know at the moment is, how it is possible to use algorithms
> like in II. mentioned with Solr.
> 
> I hope you understand the problem here - Solr *seems* to me as it would
> not be the best solution for a web-search-engine, because of scaling
> reasons in indexing. 
> 
> 
> Where should I dive deeper? 
> Solr + Droids?
> Solr + Nutch?
> Nutch + howToExtendNutchToMakeSearchBetter?
> 
> 
> Thanks for the discussion!
> - Mitch
> 
-- 
View this message in context: http://lucene.472066.n3.nabble.com/Solr-and-Nutch-Droids-to-use-or-not-to-use-tp890640p894391.html
Sent from the Solr - User mailing list archive at Nabble.com.

Mime
View raw message