nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Lewis John McGibbney (JIRA)" <>
Subject [jira] [Updated] (NUTCH-1741) Support of Sitemaps in Nutch 2.x
Date Tue, 25 Aug 2015 06:22:46 GMT


Lewis John McGibbney updated NUTCH-1741:
    Attachment: NUTCH-1741v5.patch

Patch for 2.X HEAD which adds missing license headers, applies cleanly with no fuzziness and
builds and tests successfully.

[~cguzel] great work on this pach. We have a few issues.
 1) as you've mentioned on the mailing list, we have some issues with the MemStore in Gora
which means we need to fix this. We need to be running the tests in order to put use the code
you've implemented and also to build confidence in the sitemap parser logic.
 2) What about adding your implementation to the src/bin scripts? Are you happy with this
not being part of the logic contained within there? Maybe at a later stage we can think about
 3) I notice no Javadoc for new classes you've implemented... can you add Javadoc to detail
what the Sitemap data struture (Map<CharSequence, CharSequence>) looks like, how the
logic works, etc? This would make it much more clear to others trying to read the code.
 4) I like the way that you've consistently modularized code into methods throughout your
new work. This is really nice.

If we can address the above then we will be good to think about further validation through
testing and thn merging into 2.X.

> Support of Sitemaps in Nutch 2.x
> --------------------------------
>                 Key: NUTCH-1741
>                 URL:
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher, generator
>            Reporter: Alparslan Avc─▒
>              Labels: gsoc2015
>             Fix For: 2.4
>         Attachments: NUTCH-1741-v2.patch, NUTCH-1741-v3.patch, NUTCH-1741-v4.patch, NUTCH-1741.patch,
NUTCH-1741v5.patch, SitemapCrawlerLifeCycle.pdf, SitemapDevelopmentFor2x.pdf
> Sitemap support has to be implemented for 2.x branch. It is being discussed in NUTCH-1465
for trunk. 

This message was sent by Atlassian JIRA

View raw message