tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shabanali Faghani <shabanali.fagh...@gmail.com>
Subject A more accurate facility for detecting Charset Encoding of HTML documents
Date Thu, 21 Jul 2016 22:31:59 GMT
Hi all,

I've developed a java library for detecting Charset Encoding of HTML
documents.
My tests show that it is much more accurate than the all existing tools in
this
context including icu4j, jchardet, juniversalchardet, cpdetector,
lucene-icu4j
and also TikaEncodingDetector.

Bellow, I've provided some links related to my library:​​
Code on github: https://github.com/shabanali-faghani/IUST-HTMLCharDet
Paper link: http://link.springer.com/chapter/10.1007/978-3-319-28940-3_17
Maven Central:
http://mvnrepository.org/artifact/ir.ac.iust/htmlchardet/1.0.0

Please let me know what is your idea to get this tool in detect
package of Tika as another class, say HTMLEncodingDetector,
implementing EncodingDetector [1] interface? Or even it may
be a better idea to have another module, say tika-encodingdetect,
and get HTMLEncodingDetector and other related classes in it with
it's own POM! ...just like the tika-langdetect module [2].

Hope that helps Tika!
-------------

>From Chris Mattmann in private contact:
>Thanks, sure please open up a PR
http://github.com/apache/tika/#contributing-via-github
> and a discussion on dev@tika.a.o and would be happy to proceed.
​​
@Chris
To open up a PR I've also created an issue in JIRA with id: TIKA-2038 [3].

Thanks,
Shabanali

[1]
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.tika/tika-core/1.9/org/apache/tika/detect/EncodingDetector.java?av=f
     OR
https://github.com/apache/tika/blob/master/tika-core/src/main/java/org/apache/tika/detect/EncodingDetector.java
[2] https://github.com/apache/tika/tree/master/tika-langdetect
[3] https://issues.apache.org/jira/browse/TIKA-2038

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message