nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Julio Garcés Teuber <ju...@xinergia.com>
Subject Re: Page deletion and tracking change between crawlings
Date Fri, 02 Sep 2011 13:06:59 GMT
Hi Lewis!

Sorry for the delay in coming back to you but I was busy attacking other
fronts. Now I'm back in full with Nutch integration. Summarizing your tips
we have the following:

- In order to check which pages have changed I can use adaptive fetching
interval as reference. I can find more on this subject on nutch-default.xml
and Nutch discussion lists.
- Another way to track changes would be to make dumps before and after
crawling
- Finally to find out which pages have been deleted you recommend to check
the log. May I ask which log? Also do you think the log has a detailed list
of deleted pages or just the total count? Will this also remove the indexes
for deleted pages on Nutch?

Thank you once again for your help I will do my homework with the first two
bullets and will highly appreciate more info on the third.

Cheers!
Julio.

On Wed, Jul 27, 2011 at 5:34 PM, lewis john mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> Hi Julio,
>
> The algorithm you are referring to is called the adaptive fetching interval
> in Nutch. There is some basic reading on this in nutch-default.xml and
> should also be a good deal on the user@ list (as well as dev@). If you
> require more information on this then please say however I'm sure you should
> be able to suss it out.
>
> Information on your post processing is quite vague and doesn't give much
> indication of exactly what data we need in order to streamline your post
> processing activity, for clarity is it possible to expand upon what you
> provided?
>
> I know this sounds really basic, but you could do a dump of your crawldb
> before and after for comparison or similarity analysis. This way we would
> find the status of URLs in crawldb as well as when they were last fetched
> and whether or not then have been updated since last crawl.
>
> Solr clean will remove various pages you mention, as a method for
> reflecting an accurate representation of the web graph in your index,
> however, again I am not entirelty sure if we have a method for determining
> exactly which pages were removed, however we do get log output telling us
> how many pages were removed.
>
> On Wed, Jul 27, 2011 at 3:44 PM, Julio Garcés Teuber <julio@xinergia.com>wrote:
>
>> I have configured a *Nutch* instance to continuously crawl a particular
>> site. I have successfully managed to get the website data in bulk and base
>> on that to post process that information.
>>
>> The problem I'm facing now is that every time I run the crawling process I
>> have to post process the whole site. I want to optimize the post processing
>> and in order to do so I need to get from *Nutch* the list of pages that
>> have changed since the last crawling process was run.
>>
>> What's the best way to do that? Is there already a mechanism in *Nutch*that keeps
track of the last time the content of a page has changed? Do have
>> to create an registry of crawled pages with and md5 for example and keep
>> track of changes my self? Aside tracking which pages have changed I also
>> need to track which pages have been removed since last crawl. Is there a
>> specific mechanism to track removed pages (i.e. 404, 301, 302 HTTP codes)?
>>
>> Any tips, ideas or sharing of experiences will be more than welcome and I
>> will gladly share the lessons learnt once I have the thing running.
>>
>> Hugs,
>> Julio
>>
>> --
>> XNG | Julio Garcés Teuber
>> Email: julio@xinergia.com
>> Skype: julio.xng
>> Tel: +54 (11) 4777.9488
>> Fax: +1 (320) 514.4271
>> http://www.xinergia.com/
>>
>
>
>
> --
> *Lewis*
>
>


-- 
XNG | Julio Garcés Teuber
Email: julio@xinergia.com
Skype: julio.xng
Tel: +54 (11) 4777.9488
Fax: +1 (320) 514.4271
http://www.xinergia.com/

Mime
View raw message