tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris A. Mattmann (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2359) Extreme slow parsing on the attachment attached
Date Fri, 12 May 2017 19:56:05 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16008628#comment-16008628
] 

Chris A. Mattmann commented on TIKA-2359:
-----------------------------------------

This is a tough one. In general I'd be fine to add a parameter in the tesseract config that's
a boolean org.apache.tika.parser.ocr.tesseract.enable (default "false"). That said, to do
so, would inhibit those since TIKA-93 that expect if they install Tesseract, Tika picks it
up, and uses it. So, it would be an extremely non-back compat change b/c now we would require
users to install some config file, update their java sysprops, or tika config parameters,
which isn't nice at all. Part of the convenience of Tika "picking up" tesseract is that it
is zero config, zero maintenance. 

 Any change to this needs careful thought, documentation updates on the wiki, in CHANGES.txt,
and convenience scripts, etc, that make it extremely painless for the one time upgrade, and
going forward to use OCR with Tika. I am in the boat of users that depends/relies on this
by default if tesseract is available/installed.

Consider the opposite - would it be so hard to simply add a property to turn it on/off, and
have it on by default (and then allow it to  be disabled with e.g., java -Dorg.apache.tika.parser.ocr.tesseract=false?
To me that's easier, handles the back compat better, and is less intrusive.

My 2c.

> Extreme slow parsing on the attachment attached
> -----------------------------------------------
>
>                 Key: TIKA-2359
>                 URL: https://issues.apache.org/jira/browse/TIKA-2359
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Eugen Mayer
>         Attachments: Sample-doc-file-2000kb.doc
>
>
> i have 93s for parsing this document using 1.14 in server or in cli mode.
> Java:
> java version "1.8.0_121"
> Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)
> debian-jessie, 8GB ram in a docker container, current xeon 3GHz, so decent (2 cores limited)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message