tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2520) OptimaizeLangDetector#loadModels() should not be called for every single langdetect HTTP request
Date Thu, 24 May 2018 20:55:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16489788#comment-16489788
] 

ASF GitHub Bot commented on TIKA-2520:
--------------------------------------

chrismattmann closed pull request #237: TIKA-2520 optimize OptimaizeLangDetector default loadModel()
URL: https://github.com/apache/tika/pull/237
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/tika-langdetect/src/main/java/org/apache/tika/langdetect/OptimaizeLangDetector.java
b/tika-langdetect/src/main/java/org/apache/tika/langdetect/OptimaizeLangDetector.java
index d31559238..585b74819 100644
--- a/tika-langdetect/src/main/java/org/apache/tika/langdetect/OptimaizeLangDetector.java
+++ b/tika-langdetect/src/main/java/org/apache/tika/langdetect/OptimaizeLangDetector.java
@@ -30,6 +30,8 @@
 import org.apache.tika.language.detect.LanguageNames;
 import org.apache.tika.language.detect.LanguageResult;
 
+import com.google.common.collect.ImmutableList;
+import com.google.common.collect.ImmutableSet;
 import com.optimaize.langdetect.DetectedLanguage;
 import com.optimaize.langdetect.LanguageDetectorBuilder;
 import com.optimaize.langdetect.i18n.LdLocale;
@@ -44,6 +46,27 @@
  */
 public class OptimaizeLangDetector extends LanguageDetector {
 
+	private static final List<LanguageProfile> DEFAULT_LANGUAGE_PROFILES;
+	private static final ImmutableSet<String> DEFAULT_LANGUAGES;
+	private static final com.optimaize.langdetect.LanguageDetector DEFAULT_DETECTOR;
+
+
+	static {
+		try {
+			DEFAULT_LANGUAGE_PROFILES = ImmutableList.copyOf(new LanguageProfileReader().readAllBuiltIn());
+
+			ImmutableSet.Builder<String> builder = new ImmutableSet.Builder<>();
+			for (LanguageProfile profile : DEFAULT_LANGUAGE_PROFILES) {
+				builder.add(makeLanguageName(profile.getLocale()));
+			}
+			DEFAULT_LANGUAGES = builder.build();
+
+			DEFAULT_DETECTOR = createDetector(DEFAULT_LANGUAGE_PROFILES, null);
+		} catch (IOException e) {
+			throw new RuntimeException("can't initialize OptimaizeLangDetector");
+		}
+	}
+
 	private static final int MAX_CHARS_FOR_DETECTION = 20000;
 	private static final int MAX_CHARS_FOR_SHORT_DETECTION = 200;
 	
@@ -51,7 +74,7 @@
 	private CharArrayWriter writer;
 	private Set<String> languages;
 	private Map<String, Float> languageProbabilities;
-	
+
 	public OptimaizeLangDetector() {
 		super();
 		
@@ -59,24 +82,23 @@ public OptimaizeLangDetector() {
 	}
 	
 	@Override
-	public LanguageDetector loadModels() throws IOException {
-		List<LanguageProfile> languageProfiles = new LanguageProfileReader().readAllBuiltIn();
-		
+	public LanguageDetector loadModels() {
 		// FUTURE when the "language-detector" project supports short profiles, check if
 		// isShortText() returns true and switch to those.
-		
-		languages = new HashSet<>();
-		for (LanguageProfile profile : languageProfiles) {
-			languages.add(makeLanguageName(profile.getLocale()));
+
+		languages = DEFAULT_LANGUAGES;
+
+		if (languageProbabilities != null) {
+			detector = createDetector(DEFAULT_LANGUAGE_PROFILES, languageProbabilities);
+		} else {
+			detector = DEFAULT_DETECTOR;
 		}
-		
-		detector = createDetector(languageProfiles);
-		
+
 		return this;
 
 	}
 
-	private String makeLanguageName(LdLocale locale) {
+	private static String makeLanguageName(LdLocale locale) {
 		return LanguageNames.makeName(locale.getLanguage(), locale.getScript().orNull(), locale.getRegion().orNull());
 	}
 
@@ -98,12 +120,12 @@ public LanguageDetector loadModels(Set<String> languages) throws
IOException {
 			}
 		}
 		
-		detector = createDetector(new LanguageProfileReader().readBuiltIn(locales));
+		detector = createDetector(new LanguageProfileReader().readBuiltIn(locales), languageProbabilities);
 		
 		return this;
 	}
 
-	private com.optimaize.langdetect.LanguageDetector createDetector(List<LanguageProfile>
languageProfiles) {
+	private static com.optimaize.langdetect.LanguageDetector createDetector(List<LanguageProfile>
languageProfiles, Map<String, Float> languageProbabilities) {
 		// FUTURE currently the short text algorithm doesn't normalize probabilities until the
end, which
 		// means you can often get 0 probabilities. So we pick a very short length for this limit.
 		LanguageDetectorBuilder builder = LanguageDetectorBuilder.create(NgramExtractors.standard())


 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> OptimaizeLangDetector#loadModels() should not be called for every single langdetect HTTP
request
> ------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-2520
>                 URL: https://issues.apache.org/jira/browse/TIKA-2520
>             Project: Tika
>          Issue Type: Improvement
>          Components: server
>    Affects Versions: 1.16
>            Reporter: Vincent van Donselaar
>            Priority: Minor
>              Labels: performance
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> Tika REST server's `/language` resource invokes the relatively heavy `loadModels` operation
for every language detect call:
> {code:title=LanguageResource.java}
> public String detect(final String string) throws IOException {
> 	LanguageResult language = new OptimaizeLangDetector().loadModels().detect(string);
> 	String detectedLang = language.getLanguage();
> 	LOG.info("Detecting language for incoming resource: [{}]", detectedLang);
> 	return detectedLang;
> }
> {code}
> This could be optimized by (lazy?) loading the models only once and keep them in memory.
I assume the `LanguageDetector` is not thread safe, so I expect this requires an ExecutorService
with language detectors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message