nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefan Groschupf ...@media-style.com>
Subject Re: Problems with updatedb
Date Wed, 23 Mar 2005 14:01:13 GMT
Isabel,
I'm may not clearly understand your problem, but createing a empthy db 
and update the db with all segments is the way to go.
May you miss to do the db analyzing step.

HTH
Stefan
Greetings to Vogtland, isn't it. :)

Am 23.03.2005 um 14:52 schrieb Isabel Drost:

>
> Hello,
>
> before explaining my problem: Sorry if you get this mail twice - I 
> already
> sent it today morning, but it seems not to have reached the list.
>
> I have a problem with the updatedb tool. I should mention at first, 
> that at
> the moment I am using a cvs-version from october last year, so it 
> could well
> be, that the problem was fixed in the meantime. This morning I 
> unsuccessfully
> have tried an update of the source code, so I could not check, whether 
> the
> problem still exists.
>
> I have used the intranet crawl tool a while ago and got 5 different 
> segments
> (together with an index and webdb for each one). At the moment I am 
> trying to
> merge those segments and dbs into one to run a link analysis on the 
> whole
> link graph. Yet I ran into problems:
>
> -------------------------------
>
> Let's assume I have segments with id s1 to s5, all semgent dirs copied 
> into
> one commen directory segmentdir. I merged them with the command
>
> |1 nutch mergesegs ~/segmentdir -cm
>
> Then I created a new web db and tried to insert all links present in 
> the
>
> merged segment s6 into the webdb:
> |2 nutch admin ~/webdb -create
> |3 nutch updatedb ~/webdb ~/segmentdir/s6
>
> When comparing the links in ~/webdb to those present in the individual 
> webdbs
> created by the initial intranet crawl, I noticed that some of the 
> outlink
> objects are missing.
>
> --------------------------------
>
> So I tried to leave out the merging step and used the original 
> segments for
>
> update:
> |1 nutch admin ~/webdb -create
> |2 nutch updatedb ~/webdb ~/segmentdir/s1 [...] ~/segmentdir/s5
>
> Still some of the links in say the webdb of segment 5 were missing in 
> the
> newly created webdb. Some of the web pages had (according to
> Page.getNumOutlinks()) 34 outlinks, but they could not be retrieved 
> from the
> newly created db. Yet in the old db created during the intranet crawl 
> they
> could be retrieved very well.
>
> --------------------------------
>
> Last but not least I tried to create a webdb for one single segment, 
> s5:
> |1 nutch admin ~/webdb -create
> |2 nutch updatedb ~/webdb ~/segmentdir/s5
>
> Yet again, I observed the same behaviour: In the webdb created during
>  intranet crawl for this segment the outlinks indicated when calling
> page.getNumOutlinks() could well be retrieved from the webdb  with a 
> call to
> the method reader.getLinks(page.getMd5()). But when using the webdb 
> created
> with the two commands from above at least for some of these pages I 
> did not
> get anything.
>
> --------------------------------
>
> I suppose that I simply missed something important when creating the 
> dbs. But
> at the moment I cannot guess what I have missed. I would appreciate 
> your help
> on this problem very much.
>
> By the way: Is there any way of merging those webdbs directly? I did 
> not find
> any.
>
> --------------------------------
> --------------------------------
>
> The second question I have should be a bit simpler to answer: 
> Yesterday I did
> a dump on one of the segments. I noticed that something like the 
> size/length
> of a page seems to be stored as well as some kind of content. Please 
> correct
> my, if I am wrong here. Is there any possibility of reaching this 
> information
> from a specific Page - object? It seems not to be some kind of 
> property of
> this type of object.
>
> Sorry for the somewhat longish mail :)
>
> Cheers and have a nice day,
> Isabel
>
> -- 
> QOTD: A man in love is incomplete until he is married.  Then he is 
> finished.
> -- Zsa Zsa Gabor, "Newsweek" 
>   |\      _,,,---,,_
>   /,`.-'`'    -.  ;-;;,_  More information about the
>  |,4-  ) )-,_..;\ (  `'-' sender of this mail available
> '---''(_/--'  `-'\_) (fL) at http://www.isabel-drost.de ;)
>
>
-----------information technology-------------------
company:     http://www.media-style.com
forum:           http://www.text-mining.org
blog:	             http://www.find23.net


Mime
View raw message