tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Luis Filipe Nassif (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2359) Extreme slow parsing on the attachment attached
Date Fri, 12 May 2017 21:57:04 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16008774#comment-16008774

Luis Filipe Nassif commented on TIKA-2359:

Hi Cris, thank you!

I think this issue demonstrates a lot of users can have ocr on their systems not for Tika
and they will get a 100X performance slowdown without knowledge about that. So the original
hypothesis thrown in Tika-93 that tesseract is uncommon and if it is there it is for Tika
is wrong. New users (and some old!) may not know they have to set a Java system property to
get 100X speed up.

For users that need ocr it also should be simple to set a Java Runtime property. Of course
this is a breaking change that must be documented all around, on wiki, release notes, site
announcement, even logged. For users missing all those warnings, I think it is more likely
they will note the breaking change and search for the option to get ocr back than a new user
of Tika searching for an option to get performance speed up or to disable some ocr that they
do not know about.

So I propose for 1.15 add some logging saying "ocr is on and can cause severe slowdowns and
it Will be disabled by default in 1.16". So users will have more time to know about that.

> Extreme slow parsing on the attachment attached
> -----------------------------------------------
>                 Key: TIKA-2359
>                 URL: https://issues.apache.org/jira/browse/TIKA-2359
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Eugen Mayer
>         Attachments: Sample-doc-file-2000kb.doc
> i have 93s for parsing this document using 1.14 in server or in cli mode.
> Java:
> java version "1.8.0_121"
> Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)
> debian-jessie, 8GB ram in a docker container, current xeon 3GHz, so decent (2 cores limited)

This message was sent by Atlassian JIRA

View raw message