tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2520) OptimaizeLangDetector#loadModels() should not be called for every single langdetect HTTP request
Date Thu, 24 May 2018 22:35:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16489927#comment-16489927
] 

ASF GitHub Bot commented on TIKA-2520:
--------------------------------------

chrismattmann commented on issue #237: TIKA-2520 optimize OptimaizeLangDetector default loadModel()
URL: https://github.com/apache/tika/pull/237#issuecomment-391886526
 
 
   I also merged this into 2.x-master:
   ```
   [INFO] Installing /Users/mattmann/tmp/tika2.0.0/pom.xml to /Users/mattmann/.m2/repository/org/apache/tika/tika/2.0.0-SNAPSHOT/tika-2.0.0-SNAPSHOT.pom
   [INFO] ------------------------------------------------------------------------
   [INFO] Reactor Summary:
   [INFO] 
   [INFO] Apache Tika parent ................................. SUCCESS [  1.977 s]
   [INFO] Apache Tika core ................................... SUCCESS [ 30.959 s]
   [INFO] Apache Tika parsers ................................ SUCCESS [03:25 min]
   [INFO] Apache Tika XMP .................................... SUCCESS [  2.420 s]
   [INFO] Apache Tika serialization .......................... SUCCESS [  1.955 s]
   [INFO] Apache Tika batch .................................. SUCCESS [01:58 min]
   [INFO] Apache Tika language detection ..................... SUCCESS [  2.731 s]
   [INFO] Apache Tika application ............................ SUCCESS [01:07 min]
   [INFO] Apache Tika OSGi bundle ............................ SUCCESS [ 31.078 s]
   [INFO] Apache Tika translate .............................. SUCCESS [  3.269 s]
   [INFO] Apache Tika server ................................. SUCCESS [ 21.436 s]
   [INFO] Apache Tika examples ............................... SUCCESS [ 15.475 s]
   [INFO] Apache Tika Java-7 Components ...................... SUCCESS [  3.467 s]
   [INFO] Apache Tika eval ................................... SUCCESS [ 40.324 s]
   [INFO] Apache Tika Deep Learning (powered by DL4J) ........ SUCCESS [01:02 min]
   [INFO] Apache Tika Natural Language Processing ............ SUCCESS [ 25.107 s]
   [INFO] Apache Tika ........................................ SUCCESS [  0.030 s]
   [INFO] ------------------------------------------------------------------------
   [INFO] BUILD SUCCESS
   [INFO] ------------------------------------------------------------------------
   [INFO] Total time: 10:34 min
   [INFO] Finished at: 2018-05-24T14:30:18-07:00
   [INFO] Final Memory: 203M/1743M
   [INFO] ------------------------------------------------------------------------
   nonas:tika2.0.0 mattmann$ git push -u origin master
   Counting objects: 11, done.
   Delta compression using up to 4 threads.
   Compressing objects: 100% (7/7), done.
   Writing objects: 100% (11/11), 1.38 KiB | 1.38 MiB/s, done.
   Total 11 (delta 3), reused 0 (delta 0)
   remote: Resolving deltas: 100% (3/3), completed with 3 local objects.
   To github.com:/apache/tika.git
      e24e6afb1..5c1143b30  master -> master
   Branch 'master' set up to track remote branch 'master' from 'origin'.
   nonas:tika2.0.0 mattmann$ 
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> OptimaizeLangDetector#loadModels() should not be called for every single langdetect HTTP
request
> ------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-2520
>                 URL: https://issues.apache.org/jira/browse/TIKA-2520
>             Project: Tika
>          Issue Type: Improvement
>          Components: server
>    Affects Versions: 1.16
>            Reporter: Vincent van Donselaar
>            Assignee: Chris A. Mattmann
>            Priority: Minor
>              Labels: performance
>             Fix For: 1.19
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> Tika REST server's `/language` resource invokes the relatively heavy `loadModels` operation
for every language detect call:
> {code:title=LanguageResource.java}
> public String detect(final String string) throws IOException {
> 	LanguageResult language = new OptimaizeLangDetector().loadModels().detect(string);
> 	String detectedLang = language.getLanguage();
> 	LOG.info("Detecting language for incoming resource: [{}]", detectedLang);
> 	return detectedLang;
> }
> {code}
> This could be optimized by (lazy?) loading the models only once and keep them in memory.
I assume the `LanguageDetector` is not thread safe, so I expect this requires an ExecutorService
with language detectors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message