nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "hussein Al_Ahmad (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (NUTCH-1690) IndexClean: mark url as unindexed after clean to not delete again
Date Sat, 19 Aug 2017 15:16:01 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16132993#comment-16132993
] 

hussein Al_Ahmad edited comment on NUTCH-1690 at 8/19/17 3:15 PM:
------------------------------------------------------------------

you should check if status == CrawlStatus.STATUS_DUPLICATED in the indexingJob  and skip it
if so , otherwise the duplicated page is going to be indexed in the next cycle if you'r using
-all for batchId and the url isn't generated in that cycle.


was (Author: opethema):
if you are using -all for batchId you should remove UPDATEDB_MARK also (if it exists), otherwise
the duplicated urls are going to be indexed again if they aren't generated in the next cycle



> IndexClean: mark url as unindexed after clean to not delete again
> -----------------------------------------------------------------
>
>                 Key: NUTCH-1690
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1690
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>            Reporter: Tien Nguyen Manh
>            Priority: Minor
>             Fix For: 2.5
>
>         Attachments: NUTCH-1690.patch
>
>
> We should marked a deleted page to not delete it again and again. That can simply done
by remove Index marker when we delete.
> I also change to delete duplicated url in solrclean.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message