nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doğacan Güney <doga...@gmail.com>
Subject Re: Nutch 2.0 roadmap
Date Thu, 08 Apr 2010 20:20:24 GMT
On Thu, Apr 8, 2010 at 21:11, MilleBii <millebii@gmail.com> wrote:
> Not sure what u mean by pig script, but I'd like to be able to make a
> multi-criteria selection of Url for fetching...

I mean a query language like

http://hadoop.apache.org/pig/

if we expose data correctly, then you should be able to generate on any criteria
that you want.

>  The scoring method forces into a kind of mono dimensional approach
> which is not really easy to deal with.
>
> The regex filters are good but it assumes you want select URLs on data
> which is in the URL... Pretty limited in fact
>
> I basically would like to do 'content' based crawling. Say for
> example: that I'm interested in "topic A".
> I'd'like to label URLs that match "Topic A" (user supplied logic).
> Later on I would want to crawl "topic A" urls at a certain frequency
> and non labeled urls for exploring in a different way.
>
>  This looks like hard to do right now
>
> 2010/4/8, Doğacan Güney <dogacan@gmail.com>:
>> Hi,
>>
>> On Wed, Apr 7, 2010 at 21:19, MilleBii <millebii@gmail.com> wrote:
>>> Just a question ?
>>> Will the new HBase implementation allow more sophisticated crawling
>>> strategies than the current score based.
>>>
>>> Give you a few  example of what I'd like to do :
>>> Define different crawling frequency for different set of URLs, say
>>> weekly for some url, monthly or more for others.
>>>
>>> Select URLs to re-crawl based on attributes previously extracted.Just
>>> one example: recrawl urls that contained a certain keyword (or set of)
>>>
>>> Select URLs that have not yet been crawled, at the frontier of the
>>> crawl therefore
>>>
>>
>> At some point, it would be nice to change generator so that it is only a
>> handful
>> of methods and a pig (or something else) script. So, we would provide
>> most of the functions
>> you may need during generation (accessing various data) but actual
>> generation would be a pig
>> process. This way, anyone can easily change generate any way they want
>> (even make it more jobs
>> than 2 if they want more complex schemes).
>>
>>>
>>>
>>>
>>> 2010/4/7, Doğacan Güney <dogacan@gmail.com>:
>>>> Hey everyone,
>>>>
>>>> On Tue, Apr 6, 2010 at 20:23, Andrzej Bialecki <ab@getopt.org> wrote:
>>>>> On 2010-04-06 15:43, Julien Nioche wrote:
>>>>>> Hi guys,
>>>>>>
>>>>>> I gather that we'll jump straight to  2.0 after 1.1 and that 2.0
will
>>>>>> be
>>>>>> based on what is currently referred to as NutchBase. Shall we create
a
>>>>>> branch for 2.0 in the Nutch SVN repository and have a label accordingly
>>>>>> for
>>>>>> JIRA so that we can file issues / feature requests on 2.0? Do you
think
>>>>>> that
>>>>>> the current NutchBase could be used as a basis for the 2.0 branch?
>>>>>
>>>>> I'm not sure what is the status of the nutchbase - it's missed a lot
of
>>>>> fixes and changes in trunk since it's been last touched ...
>>>>>
>>>>
>>>> I know... But I still intend to finish it, I just need to schedule
>>>> some time for it.
>>>>
>>>> My vote would be to go with nutchbase.
>>>>
>>>>>>
>>>>>> Talking about features, what else would we add apart from :
>>>>>>
>>>>>> * support for HBase : via ORM or not (see
>>>>>> NUTCH-808<https://issues.apache.org/jira/browse/NUTCH-808>
>>>>>> )
>>>>>
>>>>> This IMHO is promising, this could open the doors to small-to-medium
>>>>> installations that are currently too cumbersome to handle.
>>>>>
>>>>
>>>> Yeah, there is already a simple ORM within nutchbase that is
>>>> avro-based and should
>>>> be generic enough to also support MySQL, cassandra and berkeleydb. But
>>>> any good ORM will
>>>> be a very good addition.
>>>>
>>>>>> * plugin cleanup : Tika only for parsing - get rid of everything
else?
>>>>>
>>>>> Basically, yes - keep only stuff like HtmlParseFilters (probably with
a
>>>>> different API) so that we can post-process the DOM created in Tika from
>>>>> whatever original format.
>>>>>
>>>>> Also, the goal of the crawler-commons project is to provide APIs and
>>>>> implementations of stuff that is needed for every open source crawler
>>>>> project, like: robots handling, url filtering and url normalization,
URL
>>>>> state management, perhaps deduplication. We should coordinate our
>>>>> efforts, and share code freely so that other projects (bixo, heritrix,
>>>>> droids) may contribute to this shared pool of functionality, much like
>>>>> Tika does for the common need of parsing complex formats.
>>>>>
>>>>>> * remove index / search and delegate to SOLR
>>>>>
>>>>> +1 - we may still keep a thin abstract layer to allow other
>>>>> indexing/search backends, but the current mess of indexing/query filters
>>>>> and competing indexing frameworks (lucene, fields, solr) should go away.
>>>>> We should go directly from DOM to a NutchDocument, and stop there.
>>>>>
>>>>
>>>> Agreed. I would like to add support for katta and other indexing
>>>> backends at some point but
>>>> NutchDocument should be our canonical representation. The rest should
>>>> be up to indexing backends.
>>>>
>>>>> Regarding search - currently the search API is too low-level, with the
>>>>> custom text and query analysis chains. This needlessly introduces the
>>>>> (in)famous Nutch Query classes and Nutch query syntax limitations, We
>>>>> should get rid of it and simply leave this part of the processing to
the
>>>>> search backend. Probably we will use the SolrCloud branch that supports
>>>>> sharding and global IDF.
>>>>>
>>>>>> * new functionalities e.g. sitemap support, canonical tag etc...
>>>>>
>>>>> Plus a better handling of redirects, detecting duplicated sites,
>>>>> detection of spam cliques, tools to manage the webgraph, etc.
>>>>>
>>>>>>
>>>>>> I suppose that http://wiki.apache.org/nutch/Nutch2Architecture needs
an
>>>>>> update?
>>>>>
>>>>> Definitely. :)
>>>>>
>>>>> --
>>>>> Best regards,
>>>>> Andrzej Bialecki     <><
>>>>>  ___. ___ ___ ___ _ _   __________________________________
>>>>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>>>>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>>>>> http://www.sigram.com  Contact: info at sigram dot com
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Doğacan Güney
>>>>
>>>
>>>
>>> --
>>> -MilleBii-
>>>
>>
>>
>>
>> --
>> Doğacan Güney
>>
>
>
> --
> -MilleBii-
>



-- 
Doğacan Güney

Mime
View raw message