tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Georger Araújo (JIRA) <j...@apache.org>
Subject [jira] Commented: (TIKA-189) Text extraction from Excel files juxtaposes cells
Date Fri, 23 Jan 2009 11:57:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12666503#action_12666503
] 

Georger Araújo commented on TIKA-189:
-------------------------------------

Hi Uwe,
I tried my hand at POI itfself and came up with a patch: https://issues.apache.org/bugzilla/show_bug.cgi?id=46544
The output POI gives me now is just what I want. But, I'd rather use Tika, because doing so
I can have a single jar and command line, and stay away from messing with my CLASSPATH.
Hope this is helpful in some way. Also, as Tika is already using POI 3.5 beta 4, I second
Kumar and ask if we can have Tika support Office 2007 files natively.

> Text extraction from Excel files juxtaposes cells
> -------------------------------------------------
>
>                 Key: TIKA-189
>                 URL: https://issues.apache.org/jira/browse/TIKA-189
>             Project: Tika
>          Issue Type: Bug
>          Components: general
>    Affects Versions: 0.3
>         Environment: Tika revision is svn-20090116, platform is Windows XP Pro SP3, JDK
version is 1.6.0_06.
>            Reporter: Georger Araújo
>            Priority: Minor
>         Attachments: no_cell_separators_when_extracted.zip, TIKA-189.patch
>
>
> I plan on using Tika to extract text from Excel (both .xls and .xlsx) files for indexing.
But, I found that Tika juxtaposes cells on output. The example worksheets are in the attached
.zip file.
> I took the time to run Apache POI and it does not have this bug i.e. cells are properly
separated.
> When I run
> --begin--
> java -jar tika-0.3-SNAPSHOT-standalone.jar --text no_cell_separators_when_extracted.xls
> --end--
> I get the following output:
> --begin--
> Plan1
>     NameEmailSanta Claussanta@claus.org
>     Tooth Fairytooth@fairy.org
> --end--
> Same thing with a .xlxs file:
> --begin--
> java -jar tika-0.3-SNAPSHOT-standalone.jar --text no_cell_separators_when_extracted.xlsx
> --end--
> The output is:
> --begin--
> [Content_Types].xml
> _rels/.rels
> xl/_rels/workbook.xml.rels
> xl/workbook.xml
> xl/theme/theme1.xml
> xl/worksheets/_rels/sheet1.xml.rels
> xl/worksheets/sheet2.xml
> xl/worksheets/sheet3.xml
> xl/sharedStrings.xml
> NameEmailSanta Claussanta@claus.orgTooth Fairytooth@fairy.org
> xl/styles.xml
> xl/worksheets/sheet1.xml
> 012345
> docProps/core.xml
> GeorgerGeorger2009-01-17T15:29:04Z2009-01-17T15:30:56Z
> docProps/app.xml
> Microsoft Excel0falsePlanilhas3Plan1Plan2Plan3falsefalsefalse12.0000
> --end--
> Also note that the values from docProps/app.xml have been juxtaposed as well.
> This way, after indexing these files using the output from Tika, a search engine will
only find "Fairy" when substring matching is used, because "Tooth Fairy" becomes "Tooth Fairytooth@fairy.org".
This is suboptimal and wrong.
> Thanks for your attention. Best regards,
> Georger

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message