nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Markus Jelsma <markus.jel...@openindex.io>
Subject Re: Page deletion and tracking change between crawlings
Date Fri, 02 Sep 2011 13:25:18 GMT


On Friday 02 September 2011 15:06:59 Julio Garcés Teuber wrote:
> Hi Lewis!
> 
> Sorry for the delay in coming back to you but I was busy attacking other
> fronts. Now I'm back in full with Nutch integration. Summarizing your tips
> we have the following:
> 
> - In order to check which pages have changed I can use adaptive fetching
> interval as reference. I can find more on this subject on nutch-default.xml
> and Nutch discussion lists.
> - Another way to track changes would be to make dumps before and after
> crawling
> - Finally to find out which pages have been deleted you recommend to check
> the log. 

Easiest method is to readdb -dump the crawldb and grep for db_gone

> May I ask which log? Also do you think the log has a detailed list
> of deleted pages or just the total count

readdb -stats shows the sum of 404's.

> ? Will this also remove the indexes
> for deleted pages on Nutch?

Solrclean tool will do that for you.
> 
> Thank you once again for your help I will do my homework with the first two
> bullets and will highly appreciate more info on the third.
> 
> Cheers!
> Julio.
> 
> On Wed, Jul 27, 2011 at 5:34 PM, lewis john mcgibbney <
> 
> lewis.mcgibbney@gmail.com> wrote:
> > Hi Julio,
> > 
> > The algorithm you are referring to is called the adaptive fetching
> > interval in Nutch. There is some basic reading on this in
> > nutch-default.xml and should also be a good deal on the user@ list (as
> > well as dev@). If you require more information on this then please say
> > however I'm sure you should be able to suss it out.
> > 
> > Information on your post processing is quite vague and doesn't give much
> > indication of exactly what data we need in order to streamline your post
> > processing activity, for clarity is it possible to expand upon what you
> > provided?
> > 
> > I know this sounds really basic, but you could do a dump of your crawldb
> > before and after for comparison or similarity analysis. This way we would
> > find the status of URLs in crawldb as well as when they were last fetched
> > and whether or not then have been updated since last crawl.
> > 
> > Solr clean will remove various pages you mention, as a method for
> > reflecting an accurate representation of the web graph in your index,
> > however, again I am not entirelty sure if we have a method for
> > determining exactly which pages were removed, however we do get log
> > output telling us how many pages were removed.
> > 
> > On Wed, Jul 27, 2011 at 3:44 PM, Julio Garcés Teuber 
<julio@xinergia.com>wrote:
> >> I have configured a *Nutch* instance to continuously crawl a particular
> >> site. I have successfully managed to get the website data in bulk and
> >> base on that to post process that information.
> >> 
> >> The problem I'm facing now is that every time I run the crawling process
> >> I have to post process the whole site. I want to optimize the post
> >> processing and in order to do so I need to get from *Nutch* the list of
> >> pages that have changed since the last crawling process was run.
> >> 
> >> What's the best way to do that? Is there already a mechanism in
> >> *Nutch*that keeps track of the last time the content of a page has
> >> changed? Do have to create an registry of crawled pages with and md5
> >> for example and keep track of changes my self? Aside tracking which
> >> pages have changed I also need to track which pages have been removed
> >> since last crawl. Is there a specific mechanism to track removed pages
> >> (i.e. 404, 301, 302 HTTP codes)?
> >> 
> >> Any tips, ideas or sharing of experiences will be more than welcome and
> >> I will gladly share the lessons learnt once I have the thing running.
> >> 
> >> Hugs,
> >> Julio
> >> 
> >> --
> >> XNG | Julio Garcés Teuber
> >> Email: julio@xinergia.com
> >> Skype: julio.xng
> >> Tel: +54 (11) 4777.9488
> >> Fax: +1 (320) 514.4271
> >> http://www.xinergia.com/
> > 
> > --
> > *Lewis*

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Mime
View raw message