From dev-return-3718-apmail-tika-dev-archive=tika.apache.org@tika.apache.org Tue Jul 20 02:58:45 2010 Return-Path: Delivered-To: apmail-tika-dev-archive@www.apache.org Received: (qmail 46610 invoked from network); 20 Jul 2010 02:58:45 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 20 Jul 2010 02:58:45 -0000 Received: (qmail 13770 invoked by uid 500); 20 Jul 2010 02:58:45 -0000 Delivered-To: apmail-tika-dev-archive@tika.apache.org Received: (qmail 13637 invoked by uid 500); 20 Jul 2010 02:58:44 -0000 Mailing-List: contact dev-help@tika.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@tika.apache.org Delivered-To: mailing list dev@tika.apache.org Received: (qmail 13628 invoked by uid 99); 20 Jul 2010 02:58:43 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 20 Jul 2010 02:58:43 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.22] (HELO thor.apache.org) (140.211.11.22) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 20 Jul 2010 02:58:41 +0000 Received: from thor (localhost [127.0.0.1]) by thor.apache.org (8.13.8+Sun/8.13.8) with ESMTP id o6K2ooIG015292 for ; Tue, 20 Jul 2010 02:50:50 GMT Message-ID: <19182638.469031279594250283.JavaMail.jira@thor> Date: Mon, 19 Jul 2010 22:50:50 -0400 (EDT) From: "Chris A. Mattmann (JIRA)" To: dev@tika.apache.org Subject: [jira] Updated: (TIKA-405) Problems handling Hyperlinks and Tables in Word 97 Docs MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/TIKA-405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-405: ----------------------------------- Component/s: parser - classify the component > Problems handling Hyperlinks and Tables in Word 97 Docs > ------------------------------------------------------- > > Key: TIKA-405 > URL: https://issues.apache.org/jira/browse/TIKA-405 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 0.7 > Environment: 32-bit Ubuntu Linux > Reporter: Curtis Warner > Attachments: actual.txt, expected.txt, WordDocWithLinksAndTable.doc > > > I discovered some odd behavior while running a three-way comparison test between Tika, Aperture, and Autonomy KeyView. The input file was a test Word 97 Doc (attached) including a paragraph peppered with hyperlinks and a table filled with dummy text. KeyView generated the full text, as I expected. Aperture and Tika had identical results to one another (barring one lost whitespace character), but their outputs yielded significantly fewer tokens than KeyView's did. I've attached the output text from KeyView and Tika for reference. > There are two distinct problems I recognized in Tika's text output: > 1) Hyperlinks from the Word Doc aren't included in the output text. They appear to have been skipped completely. > 2) The values in the Word Doc's table are conglomerated all together into a single blob rather than being emitted separately, which ruins any attempt at tokenizing the table's contents. > Seeing as both Tika and Aperture had exactly the same issues with this test file, my guess is that it's a problem with the shared POI library. I thought it would be worth noting, though, in case there's an easy fix on the Tika end of things. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.