nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Markus Jelsma (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-1706) IndexerMapReduce does not remove db_redir_temp etc
Date Tue, 18 Feb 2014 14:58:20 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13904113#comment-13904113
] 

Markus Jelsma commented on NUTCH-1706:
--------------------------------------

By the way, i got your latest patch and test data, but i don't see the problem:

{code}
markus@midas:~/projects/apache/nutch/branches/trunk/runtime/local$ bin/nutch index -Dplugin.includes="indexer-dummy|index-basic"
 -Ddummy.path=index2013.txt crawl/crawldb     crawl/segments/20131115203640/     crawl/segments/20131115203847/
    -deleteGone
Indexer: starting at 2014-02-18 15:57:01
Indexer: deleting gone documents: true
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters :
DummyIndexWriter
        dummy.path : Path of the file to write to (mandatory)


Indexer: finished at 2014-02-18 15:57:04, elapsed: 00:00:02
markus@midas:~/projects/apache/nutch/branches/trunk/runtime/local$ bin/nutch index -Dplugin.includes="indexer-dummy|index-basic"
-Ddummy.path=index2014.txt crawl/crawldb     crawl/segments/20131115203640/     crawl/segments/20131115203847/
    crawl/segments/20140217140849/     -deleteGone 
Indexer: starting at 2014-02-18 15:57:25
Indexer: deleting gone documents: true
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters :
DummyIndexWriter
        dummy.path : Path of the file to write to (mandatory)


Indexer: finished at 2014-02-18 15:57:28, elapsed: 00:00:02
markus@midas:~/projects/apache/nutch/branches/trunk/runtime/local$ diff index2013.txt index2014.txt

markus@midas:~/projects/apache/nutch/branches/trunk/runtime/local$ 
{code}

> IndexerMapReduce does not remove db_redir_temp etc
> --------------------------------------------------
>
>                 Key: NUTCH-1706
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1706
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.7
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Blocker
>             Fix For: 1.8
>
>         Attachments: NUTCH-1706-trunk-v2.patch, NUTCH-1706-trunk.patch, nutch-1706-testdata.tgz
>
>
> Code path is wrong in IndexerMapReduce, the delete code should be located after all reducer
values have been gathered.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message