tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2520) OptimaizeLangDetector#loadModels() should not be called for every single langdetect HTTP request
Date Thu, 24 May 2018 20:50:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16489780#comment-16489780

ASF GitHub Bot commented on TIKA-2520:

kkrugler commented on issue #237: TIKA-2520 optimize OptimaizeLangDetector default loadModel()
URL: https://github.com/apache/tika/pull/237#issuecomment-391855278
   @mbaechler - yes, my bad...the version of LanguageDetector we're using doesn't have state,
so it is in fact mutable. Because it doesn't support incremental text processing, I had to
buffer up text in the call to `addText()` in `OptimaizeLangDetector`, versus actually calling
the detector, which is also why the `hasEnoughText()` call just relies on the length of the
text, versus any in-flight results.
   I'd modified LanguageDetector to get around that (and eventually created [Yalder](https://github.com/kkrugler/yalder)),
but I was conflating that code with what we're currently using.
   So yes, we can safely share the `DEFAULT_DETECTOR` - sorry for the noise.

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:

> OptimaizeLangDetector#loadModels() should not be called for every single langdetect HTTP
> ------------------------------------------------------------------------------------------------
>                 Key: TIKA-2520
>                 URL: https://issues.apache.org/jira/browse/TIKA-2520
>             Project: Tika
>          Issue Type: Improvement
>          Components: server
>    Affects Versions: 1.16
>            Reporter: Vincent van Donselaar
>            Priority: Minor
>              Labels: performance
>   Original Estimate: 2h
>  Remaining Estimate: 2h
> Tika REST server's `/language` resource invokes the relatively heavy `loadModels` operation
for every language detect call:
> {code:title=LanguageResource.java}
> public String detect(final String string) throws IOException {
> 	LanguageResult language = new OptimaizeLangDetector().loadModels().detect(string);
> 	String detectedLang = language.getLanguage();
> 	LOG.info("Detecting language for incoming resource: [{}]", detectedLang);
> 	return detectedLang;
> }
> {code}
> This could be optimized by (lazy?) loading the models only once and keep them in memory.
I assume the `LanguageDetector` is not thread safe, so I expect this requires an ExecutorService
with language detectors.

This message was sent by Atlassian JIRA

View raw message