nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Julien Nioche (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (NUTCH-1416) IndexerMapReduce can index older version of a document instead of latest one
Date Sun, 20 Apr 2014 15:56:14 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-1416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Julien Nioche updated NUTCH-1416:
---------------------------------

    Summary: IndexerMapReduce can index older version of a document instead of latest one
 (was: Can not update the index)

> IndexerMapReduce can index older version of a document instead of latest one
> ----------------------------------------------------------------------------
>
>                 Key: NUTCH-1416
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1416
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>            Reporter: Jianyun He
>            Priority: Critical
>
> When we update the index,can not guarantee that the contents which be indexed is the
latest.In the class IndexerMapReduce and method reduce(), it has the following code:
> public void reduce(Text key, Iterator<NutchWritable> values,
>                      OutputCollector<Text, NutchDocument> output, Reporter reporter)
throws IOException {
>    ……
>    } else if (value instanceof ParseData) {  
>       parseData = (ParseData)value;
>    } else if (value instanceof ParseText) { 
>       parseText = (ParseText)value;
>    }
>    ……
> }
> For example,30 days ago,I fetched the web page A,now I fetch it again. Then the key A
will correspond to two ParseData objects(located in different segments).But in this code,it
does not compare the fetch time and simply overwrites the previous value.So the final value
maybe the old one.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message