nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebastian Nagel (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (NUTCH-1113) Merging segments causes URLs to vanish from crawldb/index?
Date Thu, 06 Mar 2014 21:59:50 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-1113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sebastian Nagel updated NUTCH-1113:
-----------------------------------

    Attachment: NUTCH-1113-trunk-junit-fail.patch

Fixed also second problem in junit test: segments except the first one may be empty at random.
We must ensure that at least one CrawlDatum (linked or fetch) are in the segment.
With this patch junit tests now pass.

> Merging segments causes URLs to vanish from crawldb/index?
> ----------------------------------------------------------
>
>                 Key: NUTCH-1113
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1113
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.3
>            Reporter: Edward Drapkin
>            Assignee: Markus Jelsma
>            Priority: Blocker
>             Fix For: 1.8
>
>         Attachments: NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch,
NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-junit.patch, NUTCH-1113-trunk-junit-fail.patch,
NUTCH-1113-trunk-junit-final.patch, NUTCH-1113-trunk.patch, NUTCH-1113-trunk.patch, merged_segment_output.txt,
unmerged_segment_output.txt
>
>
> When I run Nutch, I use the following steps:
> nutch inject crawldb/ url.txt
> repeated 3 times:
> nutch generate crawldb/ segments/ -normalize
> nutch fetch `ls -d segments/* | tail -1`
> nutch parse `ls -d segments/* | tail -1`
> nutch update crawldb `ls -d segments/* | tail -1`
> nutch mergesegs merged/ -dir segments/
> nutch invertlinks linkdb/ -dir merged/
> nutch index index/ crawldb/ linkdb/ -dir merged/ (I forward ported the lucene indexing
code from Nutch 1.1).
> When I crawl with merging segments, I lose about 20% of the URLs that wind up in the
index vs. when I crawl without merging the segments.  Somehow the segment merger causes me
to lose ~20% of my crawl database!



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message