nutch-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stefan Grroschupf (JIRA)" <>
Subject [jira] Created: (NUTCH-25) needs 'character encoding' detector
Date Sat, 26 Mar 2005 14:55:25 GMT
needs 'character encoding' detector

         Key: NUTCH-25
     Project: Nutch
        Type: Wish
    Reporter: Stefan Grroschupf
    Priority: Trivial

transferred from:
submitted by:
Jungshik Shin

this is a follow-up to bug 993380 (figure out 'charset'
from the meta tag).

Although we can cover a lot of ground using the 'C-T'
header field in in the HTTP header and the
corresponding meta tag in html documents (and in case
of XML, we have to use a similar but a different
'parsing'), in the wild, there are a lot of documents
without any information about the character encoding
used. Browsers like Mozilla and search engines like
Google use character encoding detectors to deal with
these 'unlabelled' documents. 

Mozilla's character encoding detector is GPL/MPL'd and
we might be able to port it to Java. Unfortunately,
it's not fool-proof. However, along with some other
heuristic used by Mozilla and elsewhere, it'll be
possible to achieve a high rate of the detection. 

The following page has links to some other related pages.

In addition to the character encoding detection, we
also need to detect the language of a document, which
is even harder and should be a separate bug (although
it's related).

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators:
If you want more information on JIRA, or have a bug to report see:

View raw message