tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eugen Mayer (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (TIKA-2359) Extreme slow parsing on the attachment attached
Date Fri, 12 May 2017 07:04:04 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16007718#comment-16007718
] 

Eugen Mayer edited comment on TIKA-2359 at 5/12/17 7:03 AM:
------------------------------------------------------------

Guys as far as i understood you just explained that  you

1. Are not using the easy deterministic, fast extraction libs if the are installed by default
( as a design decision ) - like exiftool, string and others
2. But you are using the most expensive, not deterministic on, OCR, by default.

No matter what this means for legacy users, think about the decision process here - i would
say this needs to be fixed. Not for me - i got this, but believe me, i am using TIKA for 4
years now and thats the first time i stumbled upon this - i am living with this waste of time
for 4 years now.

This just fools your user base and makes you even look bad performance wise - i was comparing
tike to other doc/pdf to text libs which performed better and was about to switch - because
i had not idea i compare apple with oranges ( OCR vs plaintext ).

To give you a number, the example document take 93s with the defaults (so with OCR) and  0.9s
without. We are talking about roughly 100x slower.




was (Author: eugenmayer):
Guys as far as i understood you just explained that  you

1. Are not using the easy deterministic, fast extraction libs if the are installed by default
( as a design decision )
2. But you are using the most expensive, not deterministic on, OCR, by default.

No matter what this means for legacy issue, think about the decision process here - i would
say this needs to be fixed. Not for me - i got this, but believe me, i am using TIKA for 4
years now and thats the first time i stumbeled uppon this - i am living with this waste of
time for 4 years now.

This just fools your user base and makes you even look bad performance wise - i was comparing
tike to other doc/pdf to text libs which performed better and was about to switch - because
i had not idea i compare apple with oranges ( OCR vs plaintext ).

To give you a number, the example document take 93s with the defaults (so with OCR) and  0.9s
without. We are talking about roughly 100x slower.



> Extreme slow parsing on the attachment attached
> -----------------------------------------------
>
>                 Key: TIKA-2359
>                 URL: https://issues.apache.org/jira/browse/TIKA-2359
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Eugen Mayer
>         Attachments: Sample-doc-file-2000kb.doc
>
>
> i have 93s for parsing this document using 1.14 in server or in cli mode.
> Java:
> java version "1.8.0_121"
> Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)
> debian-jessie, 8GB ram in a docker container, current xeon 3GHz, so decent (2 cores limited)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message