nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nick Lothian (JIRA)" <>
Subject [jira] Commented: (NUTCH-25) needs 'character encoding' detector
Date Wed, 30 Mar 2005 07:00:24 GMT
     [ ]
Nick Lothian commented on NUTCH-25:

ROME ( has an XmlReader which encapsulates most of the detection
code required. See

ROME is under the Apache licence.

> needs 'character encoding' detector
> -----------------------------------
>          Key: NUTCH-25
>          URL:
>      Project: Nutch
>         Type: Wish
>     Reporter: Stefan Grroschupf
>     Priority: Trivial

> transferred from:
> submitted by:
> Jungshik Shin
> this is a follow-up to bug 993380 (figure out 'charset'
> from the meta tag).
> Although we can cover a lot of ground using the 'C-T'
> header field in in the HTTP header and the
> corresponding meta tag in html documents (and in case
> of XML, we have to use a similar but a different
> 'parsing'), in the wild, there are a lot of documents
> without any information about the character encoding
> used. Browsers like Mozilla and search engines like
> Google use character encoding detectors to deal with
> these 'unlabelled' documents. 
> Mozilla's character encoding detector is GPL/MPL'd and
> we might be able to port it to Java. Unfortunately,
> it's not fool-proof. However, along with some other
> heuristic used by Mozilla and elsewhere, it'll be
> possible to achieve a high rate of the detection. 
> The following page has links to some other related pages.
> In addition to the character encoding detection, we
> also need to detect the language of a document, which
> is even harder and should be a separate bug (although
> it's related).

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators:
If you want more information on JIRA, or have a bug to report see:

View raw message