nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrzej Bialecki ...@getopt.org>
Subject Re: Nutch dev. plans
Date Fri, 17 Jul 2009 18:32:18 GMT
Doğacan Güney wrote:
> Hey list,
> 
> On Fri, Jul 17, 2009 at 16:55, Andrzej Bialecki<ab@getopt.org> wrote:
>> Hi all,
>>
>> I think we should be creating a sandbox area, where we can collaborate
>> on various subprojects, such as HBase, OSGI, Tika parsers, etc. Dogacan will
>> be importing his HBase work as 'nutchbase'. Tika work is the least
>> disruptive, so it could occur even on trunk. OSGI plugins work (which I'd
>> like to tackle) means significant refactoring so I'd rather put this on a
>> branch too.
>>
> 
> Thanks for starting the discussion, Andrzej.
> 
> Can you detail your OSGI plugin framework design? Maybe I missed the
> discussion but
> updating the plugin system has been something that I wanted to do for
> a long time :)
> so I am very much interested in your design.

There's no specific design yet except I can't stand the existing plugin 
framework anymore ... ;) I started reading on OSGI and it seems that it 
supports the functionality that we need, and much more - it certainly 
looks like a better alternative than maintaining our plugin system 
beyond 1.x ...

Oh, an additional comment about the scoring API: I don't think the 
claimed benefits of OPIC outweigh the widespread complications that it 
caused in the API. Besides, getting the static scoring right is very 
very tricky, so from the engineer's point of view IMHO it's better to do 
the computation offline, where you have more control over the process 
and can easily re-run the computation, rather than rely on an online 
unstable algorithm that modifies scores in place ...


> 
>> Dogacan, you mentioned that you would like to work on Katta integration.
>> Could you shed some light on how this fits with the abstract indexing &
>> searching layer that we now have, and how distributed Solr fits into this
>> picture?
>>
> 
> I haven't yet given much thought to Katta integration. But basically,
> I am thinking of
> indexing newly-crawled documents as lucene shards and uploading them
> to katta for searching. This should be very possible with the new
> indexing system. But so far, I have neither studied katta too much nor
> given much thought to integration. So I may be missing obvious stuff.

Me too..

> About distributed solr: I very much like to do this and again, I
> think, this should be possible to
> do within nutch. However, distributed solr is ultimately uninteresting
> to me because (AFAIK) it doesn't have the reliability and
> high-availability that hadoop&hbase have, i.e. if a machine dies you
> lose that part of the index.

Grant Ingersoll is doing some initial work on integrating distributed 
Solr and Zookeeper, once this is in a usable shape then I think perhaps 
it's more or less equivalent to Katta. I have a patch in my queue that 
adds direct Hadoop->Solr indexing, using Hadoop OutputFormat. So there 
will be many options to push index updates to distributed indexes. We 
just need to offer the right API to implement the integration, and the 
current API is IMHO quite close.

> 
> Are there any projects going on that are live indexing systems like
> solr, yet are backed up by hadoop HDFS like katta?

There is the Bailey.sf.net project that fits this description, but it's 
dormant - either it was too early, or there were just too many design 
questions (or simply the committers moved to other things).


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Mime
View raw message