nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-2490) Sitemap processing: Sitemap index files not working
Date Wed, 03 Jan 2018 17:42:00 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16309965#comment-16309965
] 

ASF GitHub Bot commented on NUTCH-2490:
---------------------------------------

lewismc closed pull request #269: fix for NUTCH-2490 Fix sitemap index file processing
URL: https://github.com/apache/nutch/pull/269
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/src/java/org/apache/nutch/util/SitemapProcessor.java b/src/java/org/apache/nutch/util/SitemapProcessor.java
index 5150d61c3..c1c0c9a81 100644
--- a/src/java/org/apache/nutch/util/SitemapProcessor.java
+++ b/src/java/org/apache/nutch/util/SitemapProcessor.java
@@ -213,6 +213,7 @@ private void generateSitemapUrlDatum(Protocol protocol, String url, Context
cont
       AbstractSiteMap asm = parser.parseSiteMap(content.getContentType(), content.getContent(),
new URL(url));
 
       if(asm instanceof SiteMap) {
+        LOG.info("Parsing sitemap file: {}", asm.getUrl().toString());
         SiteMap sm = (SiteMap) asm;
         Collection<SiteMapURL> sitemapUrls = sm.getSiteMapUrls();
         for(SiteMapURL sitemapUrl: sitemapUrls) {
@@ -252,10 +253,13 @@ else if (asm instanceof SiteMapIndex) {
         SiteMapIndex index = (SiteMapIndex) asm;
         Collection<AbstractSiteMap> sitemapUrls = index.getSitemaps();
 
+        if (sitemapUrls.isEmpty()) {
+          return;
+        }
+
+        LOG.info("Parsing sitemap index file: {}", index.getUrl().toString());
         for(AbstractSiteMap sitemap: sitemapUrls) {
-          if(sitemap.isIndex()) {
-            generateSitemapUrlDatum(protocol, sitemap.getUrl().toString(), context);
-          }
+          generateSitemapUrlDatum(protocol, sitemap.getUrl().toString(), context);
         }
       }
     }


 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Sitemap processing: Sitemap index files not working
> ---------------------------------------------------
>
>                 Key: NUTCH-2490
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2490
>             Project: Nutch
>          Issue Type: Bug
>            Reporter: Moreno Feltscher
>            Assignee: Moreno Feltscher
>             Fix For: 1.15
>
>
> The [sitemap processing feature|https://wiki.apache.org/nutch/SitemapFeature] does not
properly handle sitemap index files due to a unnecessary conditional.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message