nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ─░lhami KALKAN (JIRA) <j...@apache.org>
Subject [jira] [Commented] (NUTCH-1321) IDNNormalizer
Date Fri, 20 Dec 2013 17:40:29 GMT

    [ https://issues.apache.org/jira/browse/NUTCH-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13854267#comment-13854267
] 

─░lhami KALKAN commented on NUTCH-1321:
--------------------------------------

Hi Sebastian,
1-)This code block is belongs to old patch version, Nutch-1321.patch. Sorry about was not
removing it. New version of isPunycode(url) exist in idnNormalizer.patch.  
2-)This patch revert only url which is punycoded to unicode while indexing. 'id' is not reverted
to unicode. Holding punycoded value while indexing. 
Is this enough for updating and deleting indexed documents or If we need to punycoded url,
can you explain a little more why we need this?

> IDNNormalizer
> -------------
>
>                 Key: NUTCH-1321
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1321
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.9
>
>         Attachments: idnNormalizer.patch
>
>
> Right now, IDN's are indexed as ASCII. An IDNNormalizer is to be used with an indexer
so it will encode ASCII URL's to their proper unicode equivalant.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

Mime
View raw message