tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nick Burch (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1020) Excel 2010 parser missing cell values are not reported resulting in missing columns values
Date Sat, 10 Nov 2012 22:21:12 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13494783#comment-13494783
] 

Nick Burch commented on TIKA-1020:
----------------------------------

The current Tika behaviour is what I'd expected, you're getting text for the cells with real
values, and things aren't cluttered for the missing cells/rows (of which there can be huge
numbers in many excel files). I'm not sure we want to be putting in cell references, blank
cells etc to the html.

If you have specific requirements in this area, eg you're actually wanting to generate things
like CSV files, then you're best off using Apache POI directly yourself which does provide
optional ways to detect these missing cells / rows and allows you to put in your own logic
to handle them as your needs dictate.
                
> Excel 2010 parser missing cell values are not reported resulting in missing columns values
> ------------------------------------------------------------------------------------------
>
>                 Key: TIKA-1020
>                 URL: https://issues.apache.org/jira/browse/TIKA-1020
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.2
>         Environment: java 1.6 & 1.7 
>            Reporter: Neil Blue
>              Labels: newbie, patch
>
> When parting an excel 2010 table, if a worksheet has a missing value, then it is not
reported in the sax handler. As a result a missing value can result in unordered data.
> For example given the table:
> {code:title=Bar.java|borderStyle=solid}
> A B B
> 1 2 3
> 4   6
> 7 8 9
> {code}
> the returned sax handler reports elements
> {code:title=Bar.java|borderStyle=solid}
> <tr><td>A</td><td>B</td><td>C</td><tr>
> <tr><td>1</td><td>2</td><td>3</td><tr>
> <tr><td>4</td><td>6</td><tr>
> <tr><td>7</td><td>8</td><td>9</td><tr>
> {code}
> As a result the handler can detect that the third row as incomplete cell values but it
is ambiguous which columns have missing data.
> As a possible fix for this excel 2010 xml data contains the cell reference value, which
could be returned to the sax handler as an attribute. 
> {code:title=Bar.java|borderStyle=solid}
> *** XSSFExcelExtractorDecorator.java    2012-11-08 10:51:55.881207100 +0000
> --- XSSFExcelExtractorDecorator.java.1  2012-11-08 10:59:02.972223700 +0000
> ***************
> *** 200,206 ****
>   
>          public void cell(String cellRef, String formattedValue) {
>             try {
> !              xhtml.startElement("td");
>   
>                // Main cell contents
>                xhtml.characters(formattedValue);
> --- 200,208 ----
>   
>          public void cell(String cellRef, String formattedValue) {
>             try {
> !              AttributesImpl attributes = new AttributesImpl();
> !              attributes.addAttribute(null, "cellRef", "cellRef", null, cellRef) ;
> !              xhtml.startElement("td",attributes);
>   
>                // Main cell contents
>                xhtml.characters(formattedValue);
> {code} 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message