tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Andrzej Bialecki (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (TIKA-1568) AutoDetectReader performance problem
Date Mon, 09 Mar 2015 17:02:38 GMT

     [ https://issues.apache.org/jira/browse/TIKA-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Andrzej Bialecki  updated TIKA-1568:
------------------------------------
    Description: 
Parsing performance of many text files suffers from repeated calls to ServiceLoader.loadServiceProviders(EncodingDetector.class).
This happens in TXTParser, HTMLParser and SourceCodeParser. In most cases, when Tika is using
the default ServiceLoader instance created in the Parser's static section this cost can be
avoided by caching the resulting List<EncodingDetector> either at a higher level in
the Parser (as a static property). If using custom ServiceLoader-s this can be achieved by
putting this list in ParsingContext, or caching these lists at a lower level in the ServiceLoader
component.

Relevant part of  the stacktrace follows:
{code}
   java.lang.Thread.State: BLOCKED (on object monitor)
	at java.util.zip.ZipFile.getEntry(ZipFile.java:304)
	- locked <0x00000007909d2e48> (a java.util.jar.JarFile)
	at java.util.jar.JarFile.getEntry(JarFile.java:227)
	at java.util.jar.JarFile.getJarEntry(JarFile.java:210)
	at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:840)
	at sun.misc.URLClassPath$JarLoader.findResource(URLClassPath.java:818)
	at sun.misc.URLClassPath$1.next(URLClassPath.java:226)
	at sun.misc.URLClassPath$1.hasMoreElements(URLClassPath.java:236)
	at java.net.URLClassLoader$3$1.run(URLClassLoader.java:583)
	at java.net.URLClassLoader$3$1.run(URLClassLoader.java:581)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.net.URLClassLoader$3.next(URLClassLoader.java:580)
	at java.net.URLClassLoader$3.hasMoreElements(URLClassLoader.java:605)
	at java.util.Collections.list(Collections.java:3687)
	at org.eclipse.jetty.webapp.WebAppClassLoader.toList(WebAppClassLoader.java:337)
	at org.eclipse.jetty.webapp.WebAppClassLoader.getResources(WebAppClassLoader.java:321)
	at org.apache.tika.config.ServiceLoader.findServiceResources(ServiceLoader.java:210)
	at org.apache.tika.config.ServiceLoader.identifyStaticServiceProviders(ServiceLoader.java:277)
	at org.apache.tika.config.ServiceLoader.loadStaticServiceProviders(ServiceLoader.java:306)
	at org.apache.tika.config.ServiceLoader.loadServiceProviders(ServiceLoader.java:228)
	at org.apache.tika.detect.AutoDetectReader.<init>(AutoDetectReader.java:104)
	at org.apache.tika.parser.txt.TXTParser.parse(TXTParser.java:70)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
	at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
...
{code}

  was:
Performance of parsing many plain text files suffers from repeated calls to ServiceLoader.loadServiceProviders(EncodingDetector.class).
In most cases, when Tika is using the default ServiceLoader instance created in TXTParser,
this cost can be avoided by caching the resulting List<EncodingDetector> either at a
higher level in TXTParser (e.g. by putting it in ParsingContext) or at a lower level in ServiceLoader.

Relevant part of  the stacktrace follows:
{code}
   java.lang.Thread.State: BLOCKED (on object monitor)
	at java.util.zip.ZipFile.getEntry(ZipFile.java:304)
	- locked <0x00000007909d2e48> (a java.util.jar.JarFile)
	at java.util.jar.JarFile.getEntry(JarFile.java:227)
	at java.util.jar.JarFile.getJarEntry(JarFile.java:210)
	at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:840)
	at sun.misc.URLClassPath$JarLoader.findResource(URLClassPath.java:818)
	at sun.misc.URLClassPath$1.next(URLClassPath.java:226)
	at sun.misc.URLClassPath$1.hasMoreElements(URLClassPath.java:236)
	at java.net.URLClassLoader$3$1.run(URLClassLoader.java:583)
	at java.net.URLClassLoader$3$1.run(URLClassLoader.java:581)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.net.URLClassLoader$3.next(URLClassLoader.java:580)
	at java.net.URLClassLoader$3.hasMoreElements(URLClassLoader.java:605)
	at java.util.Collections.list(Collections.java:3687)
	at org.eclipse.jetty.webapp.WebAppClassLoader.toList(WebAppClassLoader.java:337)
	at org.eclipse.jetty.webapp.WebAppClassLoader.getResources(WebAppClassLoader.java:321)
	at org.apache.tika.config.ServiceLoader.findServiceResources(ServiceLoader.java:210)
	at org.apache.tika.config.ServiceLoader.identifyStaticServiceProviders(ServiceLoader.java:277)
	at org.apache.tika.config.ServiceLoader.loadStaticServiceProviders(ServiceLoader.java:306)
	at org.apache.tika.config.ServiceLoader.loadServiceProviders(ServiceLoader.java:228)
	at org.apache.tika.detect.AutoDetectReader.<init>(AutoDetectReader.java:104)
	at org.apache.tika.parser.txt.TXTParser.parse(TXTParser.java:70)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
	at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
...
{code}

        Summary: AutoDetectReader performance problem  (was: TXTParser performance problem)

> AutoDetectReader performance problem
> ------------------------------------
>
>                 Key: TIKA-1568
>                 URL: https://issues.apache.org/jira/browse/TIKA-1568
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.7
>            Reporter: Andrzej Bialecki 
>
> Parsing performance of many text files suffers from repeated calls to ServiceLoader.loadServiceProviders(EncodingDetector.class).
This happens in TXTParser, HTMLParser and SourceCodeParser. In most cases, when Tika is using
the default ServiceLoader instance created in the Parser's static section this cost can be
avoided by caching the resulting List<EncodingDetector> either at a higher level in
the Parser (as a static property). If using custom ServiceLoader-s this can be achieved by
putting this list in ParsingContext, or caching these lists at a lower level in the ServiceLoader
component.
> Relevant part of  the stacktrace follows:
> {code}
>    java.lang.Thread.State: BLOCKED (on object monitor)
> 	at java.util.zip.ZipFile.getEntry(ZipFile.java:304)
> 	- locked <0x00000007909d2e48> (a java.util.jar.JarFile)
> 	at java.util.jar.JarFile.getEntry(JarFile.java:227)
> 	at java.util.jar.JarFile.getJarEntry(JarFile.java:210)
> 	at sun.misc.URLClassPath$JarLoader.getResource(URLClassPath.java:840)
> 	at sun.misc.URLClassPath$JarLoader.findResource(URLClassPath.java:818)
> 	at sun.misc.URLClassPath$1.next(URLClassPath.java:226)
> 	at sun.misc.URLClassPath$1.hasMoreElements(URLClassPath.java:236)
> 	at java.net.URLClassLoader$3$1.run(URLClassLoader.java:583)
> 	at java.net.URLClassLoader$3$1.run(URLClassLoader.java:581)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at java.net.URLClassLoader$3.next(URLClassLoader.java:580)
> 	at java.net.URLClassLoader$3.hasMoreElements(URLClassLoader.java:605)
> 	at java.util.Collections.list(Collections.java:3687)
> 	at org.eclipse.jetty.webapp.WebAppClassLoader.toList(WebAppClassLoader.java:337)
> 	at org.eclipse.jetty.webapp.WebAppClassLoader.getResources(WebAppClassLoader.java:321)
> 	at org.apache.tika.config.ServiceLoader.findServiceResources(ServiceLoader.java:210)
> 	at org.apache.tika.config.ServiceLoader.identifyStaticServiceProviders(ServiceLoader.java:277)
> 	at org.apache.tika.config.ServiceLoader.loadStaticServiceProviders(ServiceLoader.java:306)
> 	at org.apache.tika.config.ServiceLoader.loadServiceProviders(ServiceLoader.java:228)
> 	at org.apache.tika.detect.AutoDetectReader.<init>(AutoDetectReader.java:104)
> 	at org.apache.tika.parser.txt.TXTParser.parse(TXTParser.java:70)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
> 	at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message