nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-2490) Sitemap processing: Sitemap index files not working
Date Tue, 02 Jan 2018 22:55:00 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16308851#comment-16308851
] 

ASF GitHub Bot commented on NUTCH-2490:
---------------------------------------

mfeltscher opened a new pull request #269: fix for NUTCH-2490 Fix sitemap index file processing
URL: https://github.com/apache/nutch/pull/269
 
 
   This fixes processing of sitemap index files by removing a unnecessary conditional.
   
   Before:
   ```bash
   $ echo "https://filialen.migros.ch/sitemap.xml" > sitemaps.txt && bin/nutch
sitemap crawldata -sitemapUrls sitemaps.txt
   SitemapProcessor: sitemap urls dir: sitemaps.txt
   SitemapProcessor: Starting at 2018-01-02 22:44:58
   robots.txt whitelist not configured.
   SitemapProcessor: Total records rejected by filters: 0
   SitemapProcessor: Total sitemaps from HostDb: 0
   SitemapProcessor: Total sitemaps from seed urls: 1
   SitemapProcessor: Total failed sitemap fetches: 0
   SitemapProcessor: Total new sitemap entries added: 0
   SitemapProcessor: Finished at 2018-01-02 22:45:02, elapsed: 00:00:03
   ````
   
   After:
   ```bash
   $ echo "https://filialen.migros.ch/sitemap.xml" > sitemaps.txt && bin/nutch
sitemap crawldata -sitemapUrls sitemaps.txt
   SitemapProcessor: sitemap urls dir: sitemaps.txt
   SitemapProcessor: Starting at 2018-01-02 22:47:44
   robots.txt whitelist not configured.
   Parsing sitemap index file: https://filialen.migros.ch/sitemap.xml
   Parsing sitemap file: https://filialen.migros.ch/de/sitemap.xml
   Parsing sitemap file: https://filialen.migros.ch/fr/sitemap.xml
   Parsing sitemap file: https://filialen.migros.ch/it/sitemap.xml
   SitemapProcessor: Total records rejected by filters: 0
   SitemapProcessor: Total sitemaps from HostDb: 0
   SitemapProcessor: Total sitemaps from seed urls: 1
   SitemapProcessor: Total failed sitemap fetches: 0
   SitemapProcessor: Total new sitemap entries added: 5754
   SitemapProcessor: Finished at 2018-01-02 22:47:58, elapsed: 00:00:13
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Sitemap processing: Sitemap index files not working
> ---------------------------------------------------
>
>                 Key: NUTCH-2490
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2490
>             Project: Nutch
>          Issue Type: Bug
>            Reporter: Moreno Feltscher
>            Assignee: Moreno Feltscher
>
> The [sitemap processing feature](https://wiki.apache.org/nutch/SitemapFeature) does not
properly handle sitemap index files due to a unnecessary conditional.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message