nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doğacan Güney <doga...@gmail.com>
Subject Re: Nutch 2.0 roadmap
Date Thu, 08 Apr 2010 07:44:21 GMT
Hi,

On Wed, Apr 7, 2010 at 21:19, MilleBii <millebii@gmail.com> wrote:
> Just a question ?
> Will the new HBase implementation allow more sophisticated crawling
> strategies than the current score based.
>
> Give you a few  example of what I'd like to do :
> Define different crawling frequency for different set of URLs, say
> weekly for some url, monthly or more for others.
>
> Select URLs to re-crawl based on attributes previously extracted.Just
> one example: recrawl urls that contained a certain keyword (or set of)
>
> Select URLs that have not yet been crawled, at the frontier of the
> crawl therefore
>

At some point, it would be nice to change generator so that it is only a handful
of methods and a pig (or something else) script. So, we would provide
most of the functions
you may need during generation (accessing various data) but actual
generation would be a pig
process. This way, anyone can easily change generate any way they want
(even make it more jobs
than 2 if they want more complex schemes).

>
>
>
> 2010/4/7, Doğacan Güney <dogacan@gmail.com>:
>> Hey everyone,
>>
>> On Tue, Apr 6, 2010 at 20:23, Andrzej Bialecki <ab@getopt.org> wrote:
>>> On 2010-04-06 15:43, Julien Nioche wrote:
>>>> Hi guys,
>>>>
>>>> I gather that we'll jump straight to  2.0 after 1.1 and that 2.0 will be
>>>> based on what is currently referred to as NutchBase. Shall we create a
>>>> branch for 2.0 in the Nutch SVN repository and have a label accordingly
>>>> for
>>>> JIRA so that we can file issues / feature requests on 2.0? Do you think
>>>> that
>>>> the current NutchBase could be used as a basis for the 2.0 branch?
>>>
>>> I'm not sure what is the status of the nutchbase - it's missed a lot of
>>> fixes and changes in trunk since it's been last touched ...
>>>
>>
>> I know... But I still intend to finish it, I just need to schedule
>> some time for it.
>>
>> My vote would be to go with nutchbase.
>>
>>>>
>>>> Talking about features, what else would we add apart from :
>>>>
>>>> * support for HBase : via ORM or not (see
>>>> NUTCH-808<https://issues.apache.org/jira/browse/NUTCH-808>
>>>> )
>>>
>>> This IMHO is promising, this could open the doors to small-to-medium
>>> installations that are currently too cumbersome to handle.
>>>
>>
>> Yeah, there is already a simple ORM within nutchbase that is
>> avro-based and should
>> be generic enough to also support MySQL, cassandra and berkeleydb. But
>> any good ORM will
>> be a very good addition.
>>
>>>> * plugin cleanup : Tika only for parsing - get rid of everything else?
>>>
>>> Basically, yes - keep only stuff like HtmlParseFilters (probably with a
>>> different API) so that we can post-process the DOM created in Tika from
>>> whatever original format.
>>>
>>> Also, the goal of the crawler-commons project is to provide APIs and
>>> implementations of stuff that is needed for every open source crawler
>>> project, like: robots handling, url filtering and url normalization, URL
>>> state management, perhaps deduplication. We should coordinate our
>>> efforts, and share code freely so that other projects (bixo, heritrix,
>>> droids) may contribute to this shared pool of functionality, much like
>>> Tika does for the common need of parsing complex formats.
>>>
>>>> * remove index / search and delegate to SOLR
>>>
>>> +1 - we may still keep a thin abstract layer to allow other
>>> indexing/search backends, but the current mess of indexing/query filters
>>> and competing indexing frameworks (lucene, fields, solr) should go away.
>>> We should go directly from DOM to a NutchDocument, and stop there.
>>>
>>
>> Agreed. I would like to add support for katta and other indexing
>> backends at some point but
>> NutchDocument should be our canonical representation. The rest should
>> be up to indexing backends.
>>
>>> Regarding search - currently the search API is too low-level, with the
>>> custom text and query analysis chains. This needlessly introduces the
>>> (in)famous Nutch Query classes and Nutch query syntax limitations, We
>>> should get rid of it and simply leave this part of the processing to the
>>> search backend. Probably we will use the SolrCloud branch that supports
>>> sharding and global IDF.
>>>
>>>> * new functionalities e.g. sitemap support, canonical tag etc...
>>>
>>> Plus a better handling of redirects, detecting duplicated sites,
>>> detection of spam cliques, tools to manage the webgraph, etc.
>>>
>>>>
>>>> I suppose that http://wiki.apache.org/nutch/Nutch2Architecture needs an
>>>> update?
>>>
>>> Definitely. :)
>>>
>>> --
>>> Best regards,
>>> Andrzej Bialecki     <><
>>>  ___. ___ ___ ___ _ _   __________________________________
>>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>>> http://www.sigram.com  Contact: info at sigram dot com
>>>
>>>
>>
>>
>>
>> --
>> Doğacan Güney
>>
>
>
> --
> -MilleBii-
>



-- 
Doğacan Güney

Mime
View raw message