tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents
Date Fri, 12 Aug 2016 13:52:20 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15418855#comment-15418855
] 

Tim Allison edited comment on TIKA-2038 at 8/12/16 1:51 PM:
------------------------------------------------------------

bq.  But since I haven’t access to a broadband Internet connection

Oh, ok. I've been thinking about this a bit more.  I think I'd like to sample urls from Common
Crawl based on country codes in the urls.  I can take care of this in a few weeks.

bq. Please send me your markup stripper so I can use it in my code to evaluate your both stripper
and proposed algorithm.
I'll post that today.


bq. BTW, what is tika-eval code?

Code [here|https://github.com/tballison/tika/tree/TIKA-1302] still needs some work, but it
evaluates the output of two runs of Tika and reports on differences in number of exceptions,
mime detection diffs, content diff, etc.  I was hoping to have time to get this ready for
1.14, but 1.15 is looking more likely.



was (Author: tallison@mitre.org):
bq.  But since I haven’t access to a broadband Internet connection

Oh, ok. I've been thinking about this a bit more.  I think I'd like to do sample urls from
Common Crawl based on country codes in the urls.  I can take care of this in a few weeks.

bq. Please send me your markup stripper so I can use it in my code to evaluate your both stripper
and proposed algorithm.
I'll post that today.


bq. BTW, what is tika-eval code?

Code [here|https://github.com/tballison/tika/tree/TIKA-1302] still needs some work, but it
evaluates the output of two runs of Tika and reports on differences in number of exceptions,
mime detection diffs, content diff, etc.  I was hoping to have time to get this ready for
1.14, but 1.15 is looking more likely.


> A more accurate facility for detecting Charset Encoding of HTML documents
> -------------------------------------------------------------------------
>
>                 Key: TIKA-2038
>                 URL: https://issues.apache.org/jira/browse/TIKA-2038
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, detector
>            Reporter: Shabanali Faghani
>            Priority: Minor
>         Attachments: comparisons_20160803b.xlsx, comparisons_20160804.xlsx, iust_encodings.zip,
tika_1_14-SNAPSHOT_encoding_detector.zip
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents as well as
the other naturally text documents. But the accuracy of encoding detector tools, including
icu4j, in dealing with the HTML documents is meaningfully less than from which the other text
documents. Hence, in our project I developed a library that works pretty well for HTML documents,
which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as Nutch,
Lucene, Solr, etc. and these projects are strongly in connection with the HTML documents,
it seems that having such an facility in Tika also will help them to become more accurate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message