nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebastian Nagel (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (NUTCH-1321) IDNNormalizer
Date Thu, 19 Dec 2013 17:38:07 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13853056#comment-13853056
] 

Sebastian Nagel commented on NUTCH-1321:
----------------------------------------

Hi [~ilhamikalkan],
great! Thanks! The patch looks good (not tested yet). A few comments:
# method isPunycode(url)
{code}
String[] arr = url.split("\\.");
if (arr[1].startsWith("xn--"))
{code} fails for URLs like {{http://www.medizin.xn--uni-tbingen-xhb.de/}}
# maybe we should make the decoding from Punycode to Unicode in scope indexer configurable
by some property "urlnormalizer.idn.indexer.decode" or similar. URLs are used as ordinary
content (tokenized field "url") and unique ID (field "id") for updating and deleting indexed
documents. Some indexer back-ends may require the id field to be pure ASCII or Punycode.
# cosmetics: code should be formatted by [eclipse-codeformat.xml|http://svn.apache.org/viewvc/nutch/branches/2.x/eclipse-codeformat.xml],
patches generated as decribed in [1|http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer],
[2|http://wiki.apache.org/nutch/HowToContribute].

> IDNNormalizer
> -------------
>
>                 Key: NUTCH-1321
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1321
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.9
>
>         Attachments: Nutch-1321.patch, idnNormalizer.patch
>
>
> Right now, IDN's are indexed as ASCII. An IDNNormalizer is to be used with an indexer
so it will encode ASCII URL's to their proper unicode equivalant.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

Mime
View raw message