nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Cihad Guzel <>
Subject Re: GSOC2015- Sitemap crawler roudmap problems
Date Sun, 02 Aug 2015 11:48:15 GMT

I am proceesing my work. My code is integreted nutch life cycle. Sitemap
files are can injeceted and parsed. You known, sitemap file have any tags
as lastmodified, priortiy and changefreq. Firstly, I put the tags value to
metadata. Then, I update last modified and fetch inteval field of webpage
as for the tags. But I didn't use priority tags. I want to calculate new
score using priority for list of urls from sitemap. While the urls of
sitemap have priority value, another webpage urls doesn't have the value.
There are disorder.  How do you think should be implemented it?

I attached the last code as patch on this email.

2015-07-11 12:10 GMT+03:00 Cihad Guzel <>:

> Hi Lewis.
> Thanks for your suggestions. I will be thinking about this.
> 2015-07-10 3:47 GMT+03:00 Lewis John Mcgibbney <>
> :
>> Hi Cihad,
>> I'll take a look tonight.
>> My understanding is that this would be implemented as part of core and
>> not as a plugin. Within the plugin we can, at time, have acesss to less
>> verbose data structures. This is of course not always the case, but
>> generally speaking we see more issues, depending on which interfaces we
>> extend, with appropriate access to the correct data structures. We then
>> have the issue of dependency management.
>> I'll have a look through the various links you have sent and then write
>> back here in due course.
>> Apologies about the delay.
>> Thanks
>> On Mon, Jul 6, 2015 at 12:20 AM, Cihad Guzel <> wrote:
>>> Hi,
>>> I have find a patch for my metadata problem [1]. But , the problem isn't
>>> solved for 2.x [2]. I guess, I need to solve it.
>>> [1]
>>> [2]
>>> 2015-07-04 15:56 GMT+03:00 Cihad Guzel <>:
>>>> Hi Lewis,
>>>> I and Talat talk about architecture for sitemap supporting . We thought
>>>> the problem could be solved in nutch life cycle . We don't want to build
>>>> different life cycle for sitemap crawling.
>>>> So, I have some problems as following:
>>>> If the sitemap file is too large size, it can not be fetched and
>>>> parsed. It gets timeout. I solved timeout problem temporarily to parse by
>>>> raising the value of timeout in nutch-site.xml and to fetch by working
>>>> small size file. It is not good.
>>>> Moreover, you know sitemap files have some special tags as "loc",
>>>> "lastmod", "changefreq" or "priority". It has been parsed using my parse
>>>> plugin. I want to  record to crawldb, but the Parse  object doesn't
>>>> support metadata or same fields. It has only outlink array. It isn't enough
>>>> for recording metadata.
>>>> I want to record each url in sitemap file with the metadata seperately.
>>>> I viewed all patchs and comments from NUTCH-1465 and there are some
>>>> solution for same problems in it. But, new job for sitemap crawling have
>>>> been created.
>>>> Could you show me a way out?
>>>> Thanks.
>> --
>> *Lewis*

View raw message