tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ashish Basran (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2180) Multiple requests on Tika to extract text slows down
Date Mon, 28 Nov 2016 19:44:58 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15702912#comment-15702912

Ashish Basran commented on TIKA-2180:

Thanks Tim. I tried with 4 concurrent requests on 4 CPU machine and it looks like following
compare to when I run 4 requests in sequence:

(Document Name)		(processing time in seconds)
sample (2).pdf  	5.3199278
sample (3).pdf  	5.3264681
sample (1).pdf  	5.351233
sample (1).docx 	39.1171855

(Document Name)		(processing time in seconds)
sample (1).docx		40.1103043
sample (1).pdf		3.4400907
sample (2).pdf		0.4291182
sample (3).pdf		0.3833339

I have 100s of documents to process and looking at minimizing the processing time. 

> Multiple requests on Tika to extract text slows down
> ----------------------------------------------------
>                 Key: TIKA-2180
>                 URL: https://issues.apache.org/jira/browse/TIKA-2180
>             Project: Tika
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 1.13, 1.14
>         Environment: Windows OS, Open JDK, 4 core 32 GB RAM
>            Reporter: Ashish Basran
> I observed that if I send multiple requests to Tika (eg. http://localhost:8080/tika)
with around 5MB files, Tika is very slow in completing the action. I tried with ~20 random
files, it took 170 seconds to process all the files in sequence. If I pass all files in parallel,
it took around 780 seconds to process same set of files. 

This message was sent by Atlassian JIRA

View raw message