tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris A. Mattmann (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2359) Extreme slow parsing on the attachment attached
Date Fri, 12 May 2017 22:02:04 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16008782#comment-16008782
] 

Chris A. Mattmann commented on TIKA-2359:
-----------------------------------------

Hi [~lfcnassif] great points.

Your point here:
bq. I think it is more likely they will note the breaking change and search for the option
to get ocr back than a new user of Tika searching for an option to get performance speed up
or to disable some ocr that they do not know about.

I am not so sure about. In fact, the data tells me the opposite. We haven't had hundreds of
JIRAs filed by users who find Tika to be slow. In fact, quite the opposite, and OCR has been
on (if tesseract is installed - so it's not "by default", but if you have Tesseract installed,
either known or unknown) for quite a few releases now.

I'm happy to have a waiting period to consider this. I also say I think it's just as easy
either way - that is to set a system property to either enable, or disable OCR. For me, since
it's been "enabled" if Tesseract is installed (big "if") and that's been the expectation,
I would say that we ought to stay with that, and then help the handful of users that have
suggested performance is an issue in tickets like this by making that minority set the option
as a command line parameter. I would be a big +1 as you say either way to have logging say
"OCR is on, did you really want that?" or something like that.


> Extreme slow parsing on the attachment attached
> -----------------------------------------------
>
>                 Key: TIKA-2359
>                 URL: https://issues.apache.org/jira/browse/TIKA-2359
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Eugen Mayer
>         Attachments: Sample-doc-file-2000kb.doc
>
>
> i have 93s for parsing this document using 1.14 in server or in cli mode.
> Java:
> java version "1.8.0_121"
> Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)
> debian-jessie, 8GB ram in a docker container, current xeon 3GHz, so decent (2 cores limited)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message