nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yossi Tamari (JIRA)" <>
Subject [jira] [Created] (NUTCH-2511) SitemapProcessor limited by http.content.limit
Date Mon, 19 Feb 2018 16:36:00 GMT
Yossi Tamari created NUTCH-2511:

             Summary: SitemapProcessor limited by http.content.limit
                 Key: NUTCH-2511
             Project: Nutch
          Issue Type: Bug
          Components: sitemap
    Affects Versions: 1.14
            Reporter: Yossi Tamari

Because SitemapProcessor uses the HTTP protocol plugin, which limits the size of a response
to http.content.limit (64KB by default), it can only handle sitemaps smaller than that size. 

I don't believe that is the intent of the users by setting http.content.limit - they want
to limit document size, not sitemap size. The spec specifically says that sitemaps can be
up to 50MB.

This message was sent by Atlassian JIRA

View raw message