tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Luca Della Toffola (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (TIKA-1149) Improve parser lookup performance
Date Tue, 13 Aug 2013 12:27:49 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13738130#comment-13738130
] 

Luca Della Toffola edited comment on TIKA-1149 at 8/13/13 12:27 PM:
--------------------------------------------------------------------

I did a quick test with the new patch. By letting {{CompositeParser}} inherit from {{SimpleParser}}
and commenting the current {{CompositeParser.getSupportedTypes(ParseContext)}} method I obtain
~5% speedup. I used the same workload as before and I ran Tika with {{-d --text}} redirecting
the output to {{/dev/null}}. Obviously all test-cases don't pass also in my case.
                
      was (Author: ldellatoffola):
    I did a quick test with the new patch. By letting {{CompositeParser}} inherit from {{SimpleParser}}
and commenting the current {{CompositeParser.getSupportedTypes(ParseContext)}} method I obtain
~5% speedup. I used the same workload as before and I ran Tika with {{-d --text}}. Obviously
all test-cases don't pass also in my case.
                  
> Improve parser lookup performance
> ---------------------------------
>
>                 Key: TIKA-1149
>                 URL: https://issues.apache.org/jira/browse/TIKA-1149
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.3, 1.4
>            Reporter: Luca Della Toffola
>            Priority: Minor
>              Labels: performance
>         Attachments: 0001-TIKA-1149-Improve-parser-lookup-performance.patch, CompositeParser.patch,
ParseContext.patch
>
>
> We found an easy way to improve Tika's performance. The idea is to avoid recomputing
parsers map over and over 
> in CompositeParser.getParsers(...) if the context is empty and to cache the returned
value instead. 
> This can be done safely even under the assumption that the media-registry and the list
of component parsers do change while Tika is executing, by invalidating the cache in the case.
> Our attached patch computes the parsers map once per instance of CompositeParser.
> The patch checks for the case where the context is empty and invalidates the cache if
both media-registry and the list of component parsers change in the corresponding setters.
> For example, when running Tika 1.3 on a set of large (~50k classes) JAR files (i.e.,
Java class library + Tika app + other apps), the patch reduces the running time
> from 32 seconds to 29 seconds -- i.e., a speedup of ~12%. Speedups of the same order
of magnitude are found also for smaller workloads.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message