tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ken Krugler (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1723) Integrate language-detector into Tika
Date Thu, 03 Sep 2015 18:59:46 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14729588#comment-14729588
] 

Ken Krugler commented on TIKA-1723:
-----------------------------------

Hi Tim,

1. Not sure about "Make language detection configurable via TikaConfig". Doesn't that get
into issues with the classloader, etc? In any case, I assume that's something [~chrismattmann]
will address in a separate issue, re making the language detection pluggable.

2. Yes re separate package, without porting current detection code.

3. Yes re making Optimaize the default detector (though this is more about #1 above). So currently
it would be "the only detector", at least for the new API.

4. I think so, though there's a philosophical issue here...should we just have one built-in
implementation, and assume that any others will be separate plug-ins implemented by somebody
else?

5. Yes re getting rid of legacy code in 2.0 (including current detection code/data & ProfilingXXX
classes)

> Integrate language-detector into Tika
> -------------------------------------
>
>                 Key: TIKA-1723
>                 URL: https://issues.apache.org/jira/browse/TIKA-1723
>             Project: Tika
>          Issue Type: Improvement
>          Components: languageidentifier
>    Affects Versions: 1.11
>            Reporter: Ken Krugler
>            Assignee: Ken Krugler
>            Priority: Minor
>         Attachments: TIKA-1723-2.patch, TIKA-1723-3.patch, TIKA-1723.patch, TIKA-1723v2.patch
>
>
> The language-detector project at https://github.com/optimaize/language-detector is faster,
has more languages (70 vs 13) and better accuracy than the built-in language detector.
> This is a stab at integrating it, with some initial findings. There are a number of issues
this raises, especially if [~chrismattmann] moves forward with turning language detection
into a pluggable extension point.
> I'll add comments with results below.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message