nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefan Groschupf>
Subject Re: Problems with updatedb
Date Wed, 23 Mar 2005 14:39:04 GMT
Ok, now I understand your problem, but I haven't any suggestion or 
solution for it. :-/
However this is a strange behavior.
This happens with and without merging segments before updating the 
Let's see if someone has any idea what the problem could be, if you get 
no response I suggest you open a bug report.

Am 23.03.2005 um 15:30 schrieb Isabel Drost:

> Stefan,
>> I'm may not clearly understand your problem, but createing a empthy db
>> and update the db with all segments is the way to go.
> OK, than at least I did not miss a step. I will try to present my 
> problem a
> bit more clearly:
> I crawled several web sites quite a while ago with the intranet 
> crawling tool.
> So what I ended up with was about 5 directories, each containing the 
> segment,
> index and webdb for the different runs. What I actually want is one 
> big webdb
> that contains the link graph of all crawled web pages.
> AFAIK, if I want to have one common webdb, I first need to merge all 
> segments,
> then create a new db and update it with the merged segment - so far 
> right? I
> have done this with the commands given in the last mail.
> Next I examined the links covered in this newly created db and 
> compared them
> to the links present in each of the 5 single webdbs. What I noticed 
> was, that
> for many pages the outlinks could not be retrieved from the newly 
> created db.
> In other words:
> I have extended the link analysis tool to fit my needs. If I use the 
> webdb of
> say the first intranet run, for the page with url I 
> get 12
> outlinks when calling WebDBReader.getLinks(page.getMD5())
> If I use the db of the combined segments, I do not get any link 
> objects for
> the page with this url using this method. Still, if I check the method
> page.getNumOutlinks() I still get the information that there should be 
> 12
> outlinks.
> I could not find any detailed documentation concerning the updatedb 
> tool:
> Could it by that links older than some predefined threshold are simply
> omitted?
>> Greetings to Vogtland, isn't it. :)
> Not quite, but near. :)
> Cheers,
> Isabel
> -- 
> QOTD: All most men really want in life is a wife, a house, two kids 
> and a car,
> a cat, no maybe a dog.  Ummm, scratch one of the kids and add a dog.
> Definitely a dog.
>   |\      _,,,---,,_
>   /,`.-'`'    -.  ;-;;,_  More information about the
>  |,4-  ) )-,_..;\ (  `'-' sender of this mail available
> '---''(_/--'  `-'\_) (fL) at ;)
-----------information technology-------------------

View raw message