tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2790) Consider switching lang-detection in tika-eval to open-nlp
Date Fri, 17 May 2019 10:40:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16842086#comment-16842086

Tim Allison commented on TIKA-2790:

The following is an email from [~joern], posted here with permission:
I was able to reproduce what you wrote above, the text you attached to
this mail is classified as che with the current model.

OpenNLP is computing all three character long ngrams for the input
text, and then runs a "bag of word" classification on them to detect
the language.
The problem here is that the counts are not taken into account, e.g.
very frequently seen ngrams get the same weight as very rare ngrams.
This seems to degrade the performance if the text gets longer.

If the code takes the counts into account the above text is classified as fra.

This is something we can fix now, and make a release of the code and
also release a new lang detect model.

Also, if we now work on it, we should look into the performance issue
you encountered. There are a few things we can do to make it faster

Here is a link to the code where the counts are ignored:

Also it is important to understand that the lang detect component can
be modified by user code and be optimized for a specific use case.

I suggest we do this:
- implement a new feature generator (this can be done in user code
without making a new release)
- the ngram code should be inlined (to be more efficient with object creation)
- maybe remove rare n-grams if the input size is very large
- take the counts into account

When we have a better feature generator, we can use it to replace the
one we ship with OpenNLP.


> Consider switching lang-detection in tika-eval to open-nlp
> ----------------------------------------------------------
>                 Key: TIKA-2790
>                 URL: https://issues.apache.org/jira/browse/TIKA-2790
>             Project: Tika
>          Issue Type: Improvement
>            Reporter: Tim Allison
>            Priority: Major
>         Attachments: fra_mixed_100000_0.0_0.txt, langid_20190509.zip, langid_20190510.zip,
langid_20190514.zip, langid_20190514_plus_minus_1.zip

This message was sent by Atlassian JIRA

View raw message