From dev-return-25288-apmail-tika-dev-archive=tika.apache.org@tika.apache.org Fri May 12 22:03:08 2017 Return-Path: X-Original-To: apmail-tika-dev-archive@www.apache.org Delivered-To: apmail-tika-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 529C819B5A for ; Fri, 12 May 2017 22:03:08 +0000 (UTC) Received: (qmail 95618 invoked by uid 500); 12 May 2017 22:03:07 -0000 Delivered-To: apmail-tika-dev-archive@tika.apache.org Received: (qmail 94913 invoked by uid 500); 12 May 2017 22:03:07 -0000 Mailing-List: contact dev-help@tika.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@tika.apache.org Delivered-To: mailing list dev@tika.apache.org Received: (qmail 94742 invoked by uid 99); 12 May 2017 22:03:07 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 12 May 2017 22:03:07 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id DBAC1CA780 for ; Fri, 12 May 2017 22:03:06 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -99.202 X-Spam-Level: X-Spam-Status: No, score=-99.202 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id SPTojHDmYtAu for ; Fri, 12 May 2017 22:03:06 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id E497E5FDA6 for ; Fri, 12 May 2017 22:03:05 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 33D57E002A for ; Fri, 12 May 2017 22:03:05 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 8F7D124323 for ; Fri, 12 May 2017 22:03:04 +0000 (UTC) Date: Fri, 12 May 2017 22:03:04 +0000 (UTC) From: "Chris A. Mattmann (JIRA)" To: dev@tika.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Comment Edited] (TIKA-2359) Extreme slow parsing on the attachment attached MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/TIKA-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16008782#comment-16008782 ] Chris A. Mattmann edited comment on TIKA-2359 at 5/12/17 10:02 PM: ------------------------------------------------------------------- Hi [~lfcnassif] great points. Your point here: bq. I think it is more likely they will note the breaking change and search for the option to get ocr back than a new user of Tika searching for an option to get performance speed up or to disable some ocr that they do not know about. I am not so sure about. In fact, the data tells me the opposite. We haven't had hundreds of JIRAs filed by users who find Tika to be slow. In fact, quite the opposite, and OCR has been on (if Tesseract is installed - so it's not "by default", but if you have Tesseract installed, either knowingly or unknowingly) for quite a few releases now. I'm happy to have a waiting period to consider this. I also say I think it's just as easy either way - that is to set a system property to either enable, or disable OCR. For me, since it's been "enabled" if Tesseract is installed (big "if") and that's been the expectation, I would say that we ought to stay with that, and then help the handful of users that have suggested performance is an issue in tickets like this by making that minority set the option as a command line parameter. I would be a big +1 as you say either way to have logging say "OCR is on, did you really want that?" or something like that. was (Author: chrismattmann): Hi [~lfcnassif] great points. Your point here: bq. I think it is more likely they will note the breaking change and search for the option to get ocr back than a new user of Tika searching for an option to get performance speed up or to disable some ocr that they do not know about. I am not so sure about. In fact, the data tells me the opposite. We haven't had hundreds of JIRAs filed by users who find Tika to be slow. In fact, quite the opposite, and OCR has been on (if Tesseract is installed - so it's not "by default", but if you have Tesseract installed, either known or unknown) for quite a few releases now. I'm happy to have a waiting period to consider this. I also say I think it's just as easy either way - that is to set a system property to either enable, or disable OCR. For me, since it's been "enabled" if Tesseract is installed (big "if") and that's been the expectation, I would say that we ought to stay with that, and then help the handful of users that have suggested performance is an issue in tickets like this by making that minority set the option as a command line parameter. I would be a big +1 as you say either way to have logging say "OCR is on, did you really want that?" or something like that. > Extreme slow parsing on the attachment attached > ----------------------------------------------- > > Key: TIKA-2359 > URL: https://issues.apache.org/jira/browse/TIKA-2359 > Project: Tika > Issue Type: Bug > Components: parser > Reporter: Eugen Mayer > Attachments: Sample-doc-file-2000kb.doc > > > i have 93s for parsing this document using 1.14 in server or in cli mode. > Java: > java version "1.8.0_121" > Java(TM) SE Runtime Environment (build 1.8.0_121-b13) > Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode) > debian-jessie, 8GB ram in a docker container, current xeon 3GHz, so decent (2 cores limited) -- This message was sent by Atlassian JIRA (v6.3.15#6346)