nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-2184) Enable IndexingJob to function with no crawldb
Date Wed, 08 Nov 2017 21:04:00 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244723#comment-16244723
] 

ASF GitHub Bot commented on NUTCH-2184:
---------------------------------------

sebastian-nagel commented on a change in pull request #95: NUTCH-2184 Enable IndexingJob to
function with no crawldb
URL: https://github.com/apache/nutch/pull/95#discussion_r149795839
 
 

 ##########
 File path: src/java/org/apache/nutch/indexer/IndexerMapReduce.java
 ##########
 @@ -52,14 +52,34 @@
 import org.apache.nutch.protocol.Content;
 import org.apache.nutch.scoring.ScoringFilterException;
 import org.apache.nutch.scoring.ScoringFilters;
-
-public class IndexerMapReduce extends Configured implements
-    Mapper<Text, Writable, Text, NutchWritable>,
-    Reducer<Text, NutchWritable, Text, NutchIndexAction> {
+import org.apache.nutch.util.NutchConfiguration;
+
+/**
+ * <p>This class is typically invoked from within 
+ * {@link org.apache.nutch.indexer.IndexingJob}
+ * and handles all MapReduce functionality required
+ * when undertaking indexing.</p>
+ * <p>This is a consequence of one or more indexing plugins 
+ * being invoked which extend 
+ * {@link org.apache.nutch.indexer.IndexWriter}.</p>
+ * <p>See 
+ * {@link org.apache.nutch.indexer.IndexerMapReduce#initMRJob(Path, Path, Collection, JobConf,
boolean)}
+ * for details on the specific data structures and parameters required for indexing.</p>
+ *
+ */
+public class IndexerMapReduce {
 
   public static final Logger LOG = LoggerFactory
       .getLogger(IndexerMapReduce.class);
 
+  // using normalizers and/or filters
+  private static boolean normalize = false;
+  private static boolean filter = false;
+
+  // url normalizers, filters and job configuration
+  private static URLNormalizers urlNormalizers;
+  private static URLFilters urlFilters;
 
 Review comment:
   This also does not work in distributed mode: mapper and reducer are executed in different
tasks/JVMs, see [NUTCH-2375/#221](https://github.com/apache/nutch/pull/221#pullrequestreview-62780003)
for the same problem.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Enable IndexingJob to function with no crawldb
> ----------------------------------------------
>
>                 Key: NUTCH-2184
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2184
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>             Fix For: 1.14
>
>         Attachments: NUTCH-2184.patch, NUTCH-2184v2.patch
>
>
> Sometimes when working with distributed team(s), we have found that we can 'loose' data
structures which are currently considered as critical e.g. crawldb, linkdb and/or segments.
> In my current scenario I have a requirement to index segment data with no accompanying
crawldb or linkdb. 
> Absence of the latter is OK as linkdb is optional however currently in [IndexerMapReduce|https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/indexer/IndexerMapReduce.java]
crawldb is mandatory. 
> This ticket should enhance the IndexerMapReduce code to support the use case where you
ONLY have segments and want to force an index for every record present.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message