nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Jelsma (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?
Date Wed, 22 Jan 2014 09:24:20 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Markus Jelsma updated NUTCH-1113:
---------------------------------

    Attachment: NUTCH-1113-junit.patch

Attached patch seems to completely fix the issue, finally!
* does not merge LINKED status
* does not merge fetch_retry status
* considers latest fetch datum

Anyone here to confirm the result? To do so you must have a lot of segments, at least so many
that the whole bunch contains a good number of url's that have been refetched in the mean
time. You need to index those segments in chronological order segments by segment (not input
them all in the indexer via -dir, that is still a bug). You should also then merge the segments
with this patch and index the merged segment.

The number of indexed documents should be the same.

> Merging segments causes URLs to vanish from crawldb/index?
> ----------------------------------------------------------
>
>                 Key: NUTCH-1113
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1113
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.3
>            Reporter: Edward Drapkin
>            Priority: Blocker
>             Fix For: 1.9
>
>         Attachments: NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch,
NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-trunk.patch,
merged_segment_output.txt, unmerged_segment_output.txt
>
>
> When I run Nutch, I use the following steps:
> nutch inject crawldb/ url.txt
> repeated 3 times:
> nutch generate crawldb/ segments/ -normalize
> nutch fetch `ls -d segments/* | tail -1`
> nutch parse `ls -d segments/* | tail -1`
> nutch update crawldb `ls -d segments/* | tail -1`
> nutch mergesegs merged/ -dir segments/
> nutch invertlinks linkdb/ -dir merged/
> nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene indexing
code from Nutch 1.1).
> When I crawl with merging segments, I lose about 20% of the URLs that wind up in the
index vs. when I crawl without merging the segments.  Somehow the segment merger causes me
to lose ~20% of my crawl database!



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message