tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Piotr B. (JIRA)" <j...@apache.org>
Subject [jira] Created: (TIKA-344) Charset hint in metadata
Date Mon, 07 Dec 2009 12:46:18 GMT
Charset hint in metadata

                 Key: TIKA-344
                 URL: https://issues.apache.org/jira/browse/TIKA-344
             Project: Tika
          Issue Type: Improvement
          Components: parser
    Affects Versions: 0.6
            Reporter: Piotr B.
            Priority: Minor

It would be nice if TextParser and HtmlParser support Metadata.CONTENT_ENCODING hint.

In my application I always prefer that hint (if it is present) over the charset detector result,
because charset detector is often wrong on short inputs (even if  match.confidence is 100)
and I know that hint if present is right in 99%.

To be more general, user might be able to change default behaviour by override a function
 F(hint, detectorResults) -> charset. 
Other solution is to create some standard strategies and let user to choose one of them:
a) hint is most important
b) charset detector result is most important
c) create some heuristic using detectorResult.confidence, hint and maybe input length
Maybe the last heuristic method would be good enough for most cases.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message