nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tejas Patil (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (NUTCH-1465) Support sitemaps in Nutch
Date Tue, 28 Jan 2014 16:49:38 GMT

     [ https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Tejas Patil updated NUTCH-1465:
-------------------------------

    Attachment: NUTCH-1465-trunk.v5.patch

Adding new patch 'v5' with below changes:
1. Added Apache license header as per review comment by [~wastl-nagel]
2. Added counters in log output as per review comment by [~wastl-nagel]
3. Implemented the change suggested by [~wastl-nagel] for 'isHost' and 'filterNormalize'.
I could do more re-factoring and make it more clean.
4. Added a new parameter "-noStrict" to control the checking done by sitemap parser 

> Support sitemaps in Nutch
> -------------------------
>
>                 Key: NUTCH-1465
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1465
>             Project: Nutch
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Lewis John McGibbney
>            Assignee: Tejas Patil
>             Fix For: 1.8
>
>         Attachments: NUTCH-1465-sitemapinjector-trunk-v1.patch, NUTCH-1465-trunk.v1.patch,
NUTCH-1465-trunk.v2.patch, NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, NUTCH-1465-trunk.v5.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 licensed and
appears to have been used successfully to parse sitemaps as per the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Mime
View raw message