tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tyler Palsulich (JIRA)" <j...@apache.org>
Subject [jira] [Closed] (TIKA-765) add icu dependency
Date Mon, 02 Mar 2015 01:38:04 GMT

     [ https://issues.apache.org/jira/browse/TIKA-765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Tyler Palsulich closed TIKA-765.
    Resolution: Won't Fix

Closing as Won't Fix since the Persian character issues seem to be solved.

> add icu dependency
> ------------------
>                 Key: TIKA-765
>                 URL: https://issues.apache.org/jira/browse/TIKA-765
>             Project: Tika
>          Issue Type: Improvement
>          Components: general
>    Affects Versions: 0.10
>            Reporter: Robert Muir
> Spinoff of TIKA-713.
> In PDFBox, reflection is used to detect if ICU is available in the classpath: if it is,
then it can use ICU BiDi support
> to properly extract right-to-left text. otherwise, the text is returned "backwards".
This is because the JDK does not
> provide the functionality needed to do this inverse BiDI reordering / arabic-unshaping.
> it would be nice to properly depend on this, so that these languages work out of box...
we do this in Apache Solr's
> tika integration (contrib/extraction) for example.
> Unlike the charset detection code from ICU that tika "includes", including BiDi support
would be trickier, because it uses
> datafiles built from unicode (These change over time and would be a hassle to maintain).
> Additionally as a note: Tika has some forked charset code from ICU... long term it would
be great to get those changes 
> into ICU as well.
> Finally as an optimization its possible to reduce the icu4j jar size if needed with http://apps.icu-project.org/datacustom/,
> but maybe as a start we could just depend upon the 'whole' icu?

This message was sent by Atlassian JIRA

View raw message