tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Neil Blue (JIRA)" <j...@apache.org>
Subject [jira] [Created] (TIKA-1020) Excel 2010 parser missing cell values are not reported resulting in missing columns values
Date Thu, 08 Nov 2012 11:25:11 GMT
Neil Blue created TIKA-1020:
-------------------------------

             Summary: Excel 2010 parser missing cell values are not reported resulting in
missing columns values
                 Key: TIKA-1020
                 URL: https://issues.apache.org/jira/browse/TIKA-1020
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 1.2
         Environment: java 1.6 & 1.7 
            Reporter: Neil Blue


When parting an excel 2010 table, if a worksheet has a missing value, then it is not reported
in the sax handler. As a result a missing value can result in unordered data.

For example given the table:

{code:title=Bar.java|borderStyle=solid}
A B B
1 2 3
4   6
7 8 9
{code}

the returned sax handler reports elements

{code:title=Bar.java|borderStyle=solid}
<tr><td>A</td><td>B</td><td>C</td><tr>
<tr><td>1</td><td>2</td><td>3</td><tr>
<tr><td>4</td><td>6</td><tr>
<tr><td>7</td><td>8</td><td>9</td><tr>
{code}

As a result the handler can detect that the third row as incomplete cell values but it is
ambiguous which columns have missing data.

As a possible fix for this excel 2010 xml data contains the cell reference value, which could
be returned to the sax handler as an attribute. 

{code:title=Bar.java|borderStyle=solid}
*** XSSFExcelExtractorDecorator.java    2012-11-08 10:51:55.881207100 +0000
--- XSSFExcelExtractorDecorator.java.1  2012-11-08 10:59:02.972223700 +0000
***************
*** 200,206 ****
  
         public void cell(String cellRef, String formattedValue) {
            try {
!              xhtml.startElement("td");
  
               // Main cell contents
               xhtml.characters(formattedValue);
--- 200,208 ----
  
         public void cell(String cellRef, String formattedValue) {
            try {
!              AttributesImpl attributes = new AttributesImpl();
!              attributes.addAttribute(null, "cellRef", "cellRef", null, cellRef) ;
!              xhtml.startElement("td",attributes);
  
               // Main cell contents
               xhtml.characters(formattedValue);


{code} 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message