tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1723) Integrate language-detector into Tika
Date Thu, 04 Feb 2016 17:15:40 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15132613#comment-15132613

Tim Allison commented on TIKA-1723:

Agreed on the ease of building the new ld framework in 2.0.  

Given Mike's comparison of Tika and langdetect [here|http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html],
even though it is now dated, I'd be willing to put our language detector on mothballs in 2.x
(i.e. leave it in 1.x, and if we need to resurrect it we can).  That said, I didn't write
that code, and I know that [~toke] on TIKA-1549 has since dramatically improved our speed.

This is certainly a large enough issue to invite feedback from the entire community.  Do we
want to drop our language detection code in 2.x?  Or is there a good reason to keep it?

> Integrate language-detector into Tika
> -------------------------------------
>                 Key: TIKA-1723
>                 URL: https://issues.apache.org/jira/browse/TIKA-1723
>             Project: Tika
>          Issue Type: Improvement
>          Components: languageidentifier
>    Affects Versions: 1.11
>            Reporter: Ken Krugler
>            Assignee: Ken Krugler
>            Priority: Minor
>         Attachments: TIKA-1723-2.patch, TIKA-1723-3.patch, TIKA-1723.patch, TIKA-1723v2.patch
> The language-detector project at https://github.com/optimaize/language-detector is faster,
has more languages (70 vs 13) and better accuracy than the built-in language detector.
> This is a stab at integrating it, with some initial findings. There are a number of issues
this raises, especially if [~chrismattmann] moves forward with turning language detection
into a pluggable extension point.
> I'll add comments with results below.

This message was sent by Atlassian JIRA

View raw message