nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebastian Nagel (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-1706) IndexerMapReduce does not remove db_redir_temp etc
Date Tue, 18 Feb 2014 10:03:20 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13903916#comment-13903916
] 

Sebastian Nagel commented on NUTCH-1706:
----------------------------------------

Hi [~markus17], point 2 is definitely a problem: in a sample crawl (seed was {{http://nutch.apache.org/}})
out of 2 fetch_notmodified items one is lost when indexing (data attached).
{code}
# 1. index only "old" segments
% bin/nutch index -Ddummy.path=index2013.txt crawl/crawldb \
    crawl/segments/20131115203640/ \
    crawl/segments/20131115203847/ \
    -deleteGone

# 2. also include "new" segment containing refetches
% bin/nutch index -Ddummy.path=index2014.txt crawl/crawldb \
    crawl/segments/20131115203640/ \
    crawl/segments/20131115203847/ \
    crawl/segments/20140217140849/ \
    -deleteGone

# 3. since the "new" segment contains only "successful" refetches (of fetch_success or fetch_notmodified)
#    both indexes should contain exactly the same number of documents. But they do not!
% diff index2013.txt index2014.txt 
26d25
< add   http://tika.apache.org/
{code}
The second not modified page ({{http://nutch.apache.org/}}) is indexed. Running the debugger
showed that ordering of values in the reduce function is different for both pages, also in
local mode. We should take this serious and check whether we could guarantee that the newest
values are always preferred (similar as in SegmentMerger).

Nevertheless a fetch_notmodified datum should never overwrite any other fetch datum. Attached
patch includes this check again, apart from that it is identical to [~markus17]'s patch.

> IndexerMapReduce does not remove db_redir_temp etc
> --------------------------------------------------
>
>                 Key: NUTCH-1706
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1706
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.7
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Blocker
>             Fix For: 1.8
>
>         Attachments: NUTCH-1706-trunk-v2.patch, NUTCH-1706-trunk.patch, nutch-1706-testdata.tgz
>
>
> Code path is wrong in IndexerMapReduce, the delete code should be located after all reducer
values have been gathered.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message