tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Luca Della Toffola (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1149) 12% performance improvement by caching in CompositeParser
Date Tue, 23 Jul 2013 14:58:49 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13716454#comment-13716454

Luca Della Toffola commented on TIKA-1149:

I tried to have a deeper look at what you suggested.
It seems to me (at least with my limited knowledge of Tika's codebase) that there is no easy/clean
way, to gain a meaningful amount of performance (> 10%), by refactoring {{CompositeParser.getParser(Metadata,
ParseContext)}}. Using the full type->parser map seems to be the cleanest way to go.

The alternative, if I understood correctly, is to add a method to {{DefaultParser}} that builds
a (new) list of parsers based upon the content of {{CompositeParser.parsers}} and the dynamic
lookup mechanism in {{ServiceLoader}}. 
To search the appropriate parser would result in something similar as the actual {{CompositeParser.getParsers(ParseContext)}}.
Instead of building each time the full type->parser map we will do a search in the returned
list of supported types from the (new combined) parsers list. A quick test using this strategy
showed only 1.85% speedup. Would be that a feasible solution for you?

> 12% performance improvement by caching in CompositeParser
> ---------------------------------------------------------
>                 Key: TIKA-1149
>                 URL: https://issues.apache.org/jira/browse/TIKA-1149
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.3, 1.4
>            Reporter: Luca Della Toffola
>            Priority: Minor
>              Labels: performance
>         Attachments: CompositeParser.patch, ParseContext.patch
> We found an easy way to improve Tika's performance. The idea is to avoid recomputing
parsers map over and over 
> in CompositeParser.getParsers(...) if the context is empty and to cache the returned
value instead. 
> This can be done safely even under the assumption that the media-registry and the list
of component parsers do change while Tika is executing, by invalidating the cache in the case.
> Our attached patch computes the parsers map once per instance of CompositeParser.
> The patch checks for the case where the context is empty and invalidates the cache if
both media-registry and the list of component parsers change in the corresponding setters.
> For example, when running Tika 1.3 on a set of large (~50k classes) JAR files (i.e.,
Java class library + Tika app + other apps), the patch reduces the running time
> from 32 seconds to 29 seconds -- i.e., a speedup of ~12%. Speedups of the same order
of magnitude are found also for smaller workloads.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message