nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doğacan Güney <doga...@gmail.com>
Subject Re: Nutch dev. plans
Date Fri, 17 Jul 2009 21:47:43 GMT
On Fri, Jul 17, 2009 at 21:32, Andrzej Bialecki<ab@getopt.org> wrote:
> Doğacan Güney wrote:
>>
>> Hey list,
>>
>> On Fri, Jul 17, 2009 at 16:55, Andrzej Bialecki<ab@getopt.org> wrote:
>>>
>>> Hi all,
>>>
>>> I think we should be creating a sandbox area, where we can collaborate
>>> on various subprojects, such as HBase, OSGI, Tika parsers, etc. Dogacan
>>> will
>>> be importing his HBase work as 'nutchbase'. Tika work is the least
>>> disruptive, so it could occur even on trunk. OSGI plugins work (which I'd
>>> like to tackle) means significant refactoring so I'd rather put this on a
>>> branch too.
>>>
>>
>> Thanks for starting the discussion, Andrzej.
>>
>> Can you detail your OSGI plugin framework design? Maybe I missed the
>> discussion but
>> updating the plugin system has been something that I wanted to do for
>> a long time :)
>> so I am very much interested in your design.
>
> There's no specific design yet except I can't stand the existing plugin
> framework anymore ... ;) I started reading on OSGI and it seems that it
> supports the functionality that we need, and much more - it certainly looks
> like a better alternative than maintaining our plugin system beyond 1.x ...
>

Couldn't agree more with the "can't stand plugin framework" :D

Any good links on OSGI stuff?

> Oh, an additional comment about the scoring API: I don't think the claimed
> benefits of OPIC outweigh the widespread complications that it caused in the
> API. Besides, getting the static scoring right is very very tricky, so from
> the engineer's point of view IMHO it's better to do the computation offline,
> where you have more control over the process and can easily re-run the
> computation, rather than rely on an online unstable algorithm that modifies
> scores in place ...
>

Yeah, I am convinced :) . I am not done yet, but I think OPIC-like scoring will
feel very natural in a hbase-backed nutch. Give me a couple more days to polish
the scoring API then we can change it if you are not happy with it.

>
>>
>>> Dogacan, you mentioned that you would like to work on Katta integration.
>>> Could you shed some light on how this fits with the abstract indexing &
>>> searching layer that we now have, and how distributed Solr fits into this
>>> picture?
>>>
>>
>> I haven't yet given much thought to Katta integration. But basically,
>> I am thinking of
>> indexing newly-crawled documents as lucene shards and uploading them
>> to katta for searching. This should be very possible with the new
>> indexing system. But so far, I have neither studied katta too much nor
>> given much thought to integration. So I may be missing obvious stuff.
>
> Me too..
>
>> About distributed solr: I very much like to do this and again, I
>> think, this should be possible to
>> do within nutch. However, distributed solr is ultimately uninteresting
>> to me because (AFAIK) it doesn't have the reliability and
>> high-availability that hadoop&hbase have, i.e. if a machine dies you
>> lose that part of the index.
>
> Grant Ingersoll is doing some initial work on integrating distributed Solr
> and Zookeeper, once this is in a usable shape then I think perhaps it's more
> or less equivalent to Katta. I have a patch in my queue that adds direct
> Hadoop->Solr indexing, using Hadoop OutputFormat. So there will be many
> options to push index updates to distributed indexes. We just need to offer
> the right API to implement the integration, and the current API is IMHO
> quite close.
>
>>
>> Are there any projects going on that are live indexing systems like
>> solr, yet are backed up by hadoop HDFS like katta?
>
> There is the Bailey.sf.net project that fits this description, but it's
> dormant - either it was too early, or there were just too many design
> questions (or simply the committers moved to other things).
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>



-- 
Doğacan Güney

Mime
View raw message