nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Krugler <>
Subject Re: Nutch dev. plans
Date Fri, 17 Jul 2009 20:03:29 GMT

>  > Dogacan, you mentioned that you would like to work on Katta integration.
>>  Could you shed some light on how this fits with the abstract indexing &
>>  searching layer that we now have, and how distributed Solr fits into this
>>  picture?
>I haven't yet given much thought to Katta integration. But basically,
>I am thinking of
>indexing newly-crawled documents as lucene shards and uploading them
>to katta for searching. This should be very possible with the new
>indexing system. But so far, I have neither studied katta too much nor
>given much thought to integration. So I may be missing obvious stuff.

I've got some experience in this area, so let me know what questions, 
if any, you've got.

But the basic approach is very simple - just create N indexes (one 
per reducer), move this to HDFS, S3, or some other location where the 
Katta master & slaves can all access the shards, and then use the 
Katta "addIndex" command or supporting Java code to deploy the index.

>About distributed solr: I very much like to do this and again, I
>think, this should be possible to
>do within nutch. However, distributed solr is ultimately uninteresting
>to me because (AFAIK) it doesn't have the reliability and
>high-availability that hadoop&hbase have, i.e. if a machine dies you
>lose that part of the index.
>Are there any projects going on that are live indexing systems like
>solr, yet are backed up by hadoop HDFS like katta?

Note that Katta doesn't use HDFS as a backing store - the shards are 
copied to the local disks of the slaves for performance reasons.

There has been work on making Katta work better for near-real time 
updating, versus the currently very batch-oriented approach. See the 
Katta list for more details.

-- Ken
Ken Krugler
+1 530-210-6378

View raw message