nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Jelsma (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?
Date Thu, 09 Jan 2014 15:34:53 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Markus Jelsma updated NUTCH-1113:
---------------------------------

    Attachment: NUTCH-1113-junit.patch

Alright, manual testing did not go very well and it takes hours and is too cumbersome so i
cooked up a unit test for these issues. It also includes a failed attempt to make SegmentMerger
implement Tool and also includes commented out versions of current trunk, NUTCH-1616 and NUTCH-1113
(single lines though).

There are two unit tests based on some randomized set of segments with a record with a random
status. testRandomTestSequence() fails on current trunk but NOT with NUTCH-1113. testRandomTestSequenceWithRedirects()
always fails! The latter injects redirections in the set of random records, this is the issue
we must fix somehow.

There may be a problem with how i inject those redirects but i think i got it right. If there's
someone here able or willing to help out then i'd be really happy, this issue haunted Nutch
from the beginning and must be dealt with! Preferably before we release 1.8!

Thanks,
Markus

> Merging segments causes URLs to vanish from crawldb/index?
> ----------------------------------------------------------
>
>                 Key: NUTCH-1113
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1113
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.3
>            Reporter: Edward Drapkin
>             Fix For: 1.9
>
>         Attachments: NUTCH-1113-junit.patch, NUTCH-1113-trunk.patch, merged_segment_output.txt,
unmerged_segment_output.txt
>
>
> When I run Nutch, I use the following steps:
> nutch inject crawldb/ url.txt
> repeated 3 times:
> nutch generate crawldb/ segments/ -normalize
> nutch fetch `ls -d segments/* | tail -1`
> nutch parse `ls -d segments/* | tail -1`
> nutch update crawldb `ls -d segments/* | tail -1`
> nutch mergesegs merged/ -dir segments/
> nutch invertlinks linkdb/ -dir merged/
> nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene indexing
code from Nutch 1.1).
> When I crawl with merging segments, I lose about 20% of the URLs that wind up in the
index vs. when I crawl without merging the segments.  Somehow the segment merger causes me
to lose ~20% of my crawl database!



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message