tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents
Date Fri, 22 Jul 2016 13:31:20 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15389493#comment-15389493
] 

Tim Allison commented on TIKA-2038:
-----------------------------------

For integration into Tika, I see two options:

1) we could add your library in maven central as a dependency.  Downside: we'd then be relying
on icu4j (not a small addition). Upside: it would make for cleaner code on our part. You'd
get better credit.

2) you could work with us on a PR to add your logic and some of your code into Tika.  This
would allow us to reuse our copy/paste of a few classes from icu4j without the entire library.

[~faghani] and fellow devs, preferences?  My preference would be 2).

[~faghani], do you mind if we add your html test set to our regular testing corpus?

> A more accurate facility for detecting Charset Encoding of HTML documents
> -------------------------------------------------------------------------
>
>                 Key: TIKA-2038
>                 URL: https://issues.apache.org/jira/browse/TIKA-2038
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, detector
>            Reporter: Shabanali Faghani
>            Priority: Minor
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents as well as
the other naturally text documents. But the accuracy of encoding detector tools, including
icu4j, in dealing with the HTML documents is meaningfully less than from which the other text
documents. Hence, in our project I developed a library that works pretty well for HTML documents,
which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as Nutch,
Lucene, Solr, etc. and these projects are strongly in connection with the HTML documents,
it seems that having such an facility in Tika also will help them to become more accurate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message