nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-2616) Review routing of deletions by Exchange component
Date Thu, 19 Jul 2018 12:52:00 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16549225#comment-16549225
] 

ASF GitHub Bot commented on NUTCH-2616:
---------------------------------------

sebastian-nagel closed pull request #363: NUTCH-2616 Review routing of deletions by Exchange
component
URL: https://github.com/apache/nutch/pull/363
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/src/java/org/apache/nutch/indexer/IndexWriters.java b/src/java/org/apache/nutch/indexer/IndexWriters.java
index db37d62f1..3ac20bfea 100644
--- a/src/java/org/apache/nutch/indexer/IndexWriters.java
+++ b/src/java/org/apache/nutch/indexer/IndexWriters.java
@@ -18,7 +18,6 @@
 
 import org.apache.hadoop.conf.Configuration;
 import org.apache.nutch.exchange.Exchanges;
-import org.apache.nutch.metadata.Nutch;
 import org.apache.nutch.plugin.Extension;
 import org.apache.nutch.plugin.ExtensionPoint;
 import org.apache.nutch.plugin.PluginRepository;
@@ -233,14 +232,10 @@ public void update(NutchDocument doc) throws IOException {
     }
   }
 
-  public void delete(String key, NutchDocument doc) throws IOException {
-    for (String indexWriterId : getIndexWriters(doc)) {
-      this.indexWriters.get(indexWriterId).getIndexWriter().delete(key);
-    }
-  }
-
   public void delete(String key) throws IOException {
-
+    for (IndexWriterWrapper iww : indexWriters.values()) {
+      iww.getIndexWriter().delete(key);
+    }
   }
 
   public void close() throws IOException {
diff --git a/src/java/org/apache/nutch/indexer/IndexerOutputFormat.java b/src/java/org/apache/nutch/indexer/IndexerOutputFormat.java
index 1837e1fcd..3ce4f8061 100644
--- a/src/java/org/apache/nutch/indexer/IndexerOutputFormat.java
+++ b/src/java/org/apache/nutch/indexer/IndexerOutputFormat.java
@@ -54,7 +54,7 @@ public void write(Text key, NutchIndexAction indexAction)
         if (indexAction.action == NutchIndexAction.ADD) {
           writers.write(indexAction.doc);
         } else if (indexAction.action == NutchIndexAction.DELETE) {
-          writers.delete(key.toString(), indexAction.doc);
+          writers.delete(key.toString());
         }
       }
     };


 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Review routing of deletions by Exchange component
> -------------------------------------------------
>
>                 Key: NUTCH-2616
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2616
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.15
>            Reporter: Sebastian Nagel
>            Priority: Major
>             Fix For: 1.15
>
>
> If the exchange component (NUTCH-2412) is enabled it must also route deletions (404,
etc.) to the configured index writers. Deletions are done alone using the document ID (URL),
there is no NutchDocument (or it's null) which needs to handled to avoid an NPE in the Exchanges
class or the exchange plugins.
> NUTCH-2412 has added a new delete method in the IndexWriters class:
> - {{delete(String, NutchDocument)}} is now called from the indexing job ({{bin/nutch
index ... -deleteGone}}). However, the NutchDocument is always null in case of deletions,
see IndexerMapReduce.DELETE_ACTION.
> - {{delete(String)}} is now a no-op but is still called from CleaningJob ({{bin/nutch
clean ...}})
> We could ([~roannel], are there better options?)
> - send deletions to all index writers. This causes a certain overhead (could be critical
if deletion lists are long).
> - pass a document containing only a single field (the document ID / URL) to the exchange
component.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message