tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Luca Della Toffola (JIRA)" <j...@apache.org>
Subject [jira] [Created] (TIKA-1149) 12% performance improvement by caching in CompositeParser
Date Mon, 22 Jul 2013 11:40:47 GMT
Luca Della Toffola created TIKA-1149:
----------------------------------------

             Summary: 12% performance improvement by caching in CompositeParser
                 Key: TIKA-1149
                 URL: https://issues.apache.org/jira/browse/TIKA-1149
             Project: Tika
          Issue Type: Improvement
          Components: parser
    Affects Versions: 1.4, 1.3
            Reporter: Luca Della Toffola
            Priority: Minor


We found an easy way to improve Tika's performance. The idea is to avoid recomputing parsers
map over and over 
in CompositeParser.getParsers(...) if the context is empty and to cache the returned value
instead. 
This can be done safely even under the assumption that the media-registry and the list of
component parsers do change while Tika is executing, by invalidating the cache in the case.
Our attached patch computes the parsers map once per instance of CompositeParser.
The patch checks for the case where the context is empty and invalidates the cache if both
media-registry and the list of component parsers change in the corresponding setters.
For example, when running Tika 1.3 on a set of large (~50k classes) JAR files (i.e., Java
class library + Tika app + other apps), the patch reduces the running time
from 32 seconds to 29 seconds -- i.e., a speedup of ~12%. Speedups of the same order of magnitude
are found also for smaller workloads.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message