nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebastian Nagel (JIRA)" <>
Subject [jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch
Date Mon, 27 Jan 2014 17:09:48 GMT


Sebastian Nagel commented on NUTCH-1465:

Great, looks good and is a really compact providing a lot of functionality. I've just started
to test SitemapProcessor, here my first comments:
* has no Apache license header
* would be nice to see counters in log output
* regarding Lewis' point #3: doesn't a comment "a hacky way" mean: "try to avoid that"? Why
not set isHost inside map(...) by {{isHost = (value instanceof HostDatum)}} and pass it as
parameter to filterNormalize()? This would avoid any errors due to incomplete heuristics,
here when testing with sitemaps accessed per file protocol:
INFO  api.HttpRobotRulesParser - Couldn't get robots.txt for http://file:/tmp/sitemap1.xml/: file
* concurrency: "returning" the value of isHost from filterNormalize() to map() per member
variable is not thread-safe and will cause problems in combination with MultithreadedMapper.
One argument more to pass it from map() to filterNormalize() per parameter.

> Support sitemaps in Nutch
> -------------------------
>                 Key: NUTCH-1465
>                 URL:
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Lewis John McGibbney
>            Assignee: Tejas Patil
>             Fix For: 1.8
>         Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, NUTCH-1465-trunk.v1.patch,
NUTCH-1465-trunk.v2.patch, NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 licensed and
appears to have been used successfully to parse sitemaps as per the discussion here[1].
> [0]
> [1]

This message was sent by Atlassian JIRA

View raw message