tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sam H (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (TIKA-1947) IllegalArgumentException stacktrace in output since POI update
Date Mon, 11 Apr 2016 14:00:31 GMT

     [ https://issues.apache.org/jira/browse/TIKA-1947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sam H updated TIKA-1947:
------------------------
    Description: 
I tried parsing an Excel document, and noticed there was an IllegalArgumentException stacktrace
in the output.

I've traced this back to https://github.com/apache/tika/commit/25cee54499126de2b90f6bd5bde8de470b422349

Attached you can find my testfile: iae.xlsx

This is the output, running 1.13-snapshot as jar
{code}
java -jar tika-app-1.13-SNAPSHOT.jar iae.xlsx


apr 11, 2016 3:56:26 PM org.apache.poi.ss.format.CellFormat <init>
WARNING: Invalid format: "_([$Ç-2]\ * #,##0.00_);"
java.lang.IllegalArgumentException: Unsupported [] format block '[' in '_([$Ç-2]\ * #,##0.00_)'
        at org.apache.poi.ss.format.CellFormatPart.formatType(CellFormatPart.java:362)
        at org.apache.poi.ss.format.CellFormatPart.getCellFormatType(CellFormatPart.java:276)
        at org.apache.poi.ss.format.CellFormatPart.<init>(CellFormatPart.java:180)
        at org.apache.poi.ss.format.CellFormat.<init>(CellFormat.java:167)
        at org.apache.poi.ss.format.CellFormat.getInstance(CellFormat.java:143)
        at org.apache.poi.ss.usermodel.DataFormatter.getFormat(DataFormatter.java:314)
        at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents(DataFormatter.java:797)
        at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents(DataFormatter.java:769)
        at org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler.endElement(XSSFSheetXMLHandler.java:354)
        at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator$XSSFSheetInterestingPartsCapturer.endElement(XSSFExcelExtractorDecorator.java:361)
        at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(Unknown
Source)
        at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(Unknown
Source)
        at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(Unknown
Source)
        at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(Unknown Source)
        at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
Source)
        at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)
        at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)
        at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(Unknown Source)
        at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(Unknown Source)
        at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown
Source)
        at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.processSheet(XSSFExcelExtractorDecorator.java:197)
        at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.buildXHTML(XSSFExcelExtractorDecorator.java:138)
        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:110)
        at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.getXHTML(XSSFExcelExtractorDecorator.java:97)
        at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
        at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
        at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:190)
        at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:491)
        at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:144)

apr 11, 2016 3:56:26 PM org.apache.poi.ss.format.CellFormat <init>
WARNING: Invalid format: "_([$Ç-2]\ * \(#,##0.00\);"
java.lang.IllegalArgumentException: Unsupported [] format block '[' in '_([$Ç-2]\ * \(#,##0.00\)'
        at org.apache.poi.ss.format.CellFormatPart.formatType(CellFormatPart.java:362)
        at org.apache.poi.ss.format.CellFormatPart.getCellFormatType(CellFormatPart.java:276)
        at org.apache.poi.ss.format.CellFormatPart.<init>(CellFormatPart.java:180)
        at org.apache.poi.ss.format.CellFormat.<init>(CellFormat.java:167)
        at org.apache.poi.ss.format.CellFormat.getInstance(CellFormat.java:143)
        at org.apache.poi.ss.usermodel.DataFormatter.getFormat(DataFormatter.java:314)
        at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents(DataFormatter.java:797)
        at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents(DataFormatter.java:769)
        at org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler.endElement(XSSFSheetXMLHandler.java:354)
        at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator$XSSFSheetInterestingPartsCapturer.endElement(XSSFExcelExtractorDecorator.java:361)
        at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(Unknown
Source)
        at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(Unknown
Source)
        at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(Unknown
Source)
        at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(Unknown Source)
        at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
Source)
        at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)
        at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)
        at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(Unknown Source)
        at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(Unknown Source)
        at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown
Source)
        at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.processSheet(XSSFExcelExtractorDecorator.java:197)
        at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.buildXHTML(XSSFExcelExtractorDecorator.java:138)
        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:110)
        at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.getXHTML(XSSFExcelExtractorDecorator.java:97)
        at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
        at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
        at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:190)
        at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:491)
        at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:144)

apr 11, 2016 3:56:26 PM org.apache.poi.ss.format.CellFormat <init>
WARNING: Invalid format: "_([$Ç-2]\ * "-"??_);"
java.lang.IllegalArgumentException: Unsupported [] format block '[' in '_([$Ç-2]\ * "-"??_)'
        at org.apache.poi.ss.format.CellFormatPart.formatType(CellFormatPart.java:362)
        at org.apache.poi.ss.format.CellFormatPart.getCellFormatType(CellFormatPart.java:276)
        at org.apache.poi.ss.format.CellFormatPart.<init>(CellFormatPart.java:180)
        at org.apache.poi.ss.format.CellFormat.<init>(CellFormat.java:167)
        at org.apache.poi.ss.format.CellFormat.getInstance(CellFormat.java:143)
        at org.apache.poi.ss.usermodel.DataFormatter.getFormat(DataFormatter.java:314)
        at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents(DataFormatter.java:797)
        at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents(DataFormatter.java:769)
        at org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler.endElement(XSSFSheetXMLHandler.java:354)
        at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator$XSSFSheetInterestingPartsCapturer.endElement(XSSFExcelExtractorDecorator.java:361)
        at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(Unknown
Source)
        at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(Unknown
Source)
        at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(Unknown
Source)
        at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(Unknown Source)
        at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
Source)
        at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)
        at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)
        at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(Unknown Source)
        at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(Unknown Source)
        at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown
Source)
        at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.processSheet(XSSFExcelExtractorDecorator.java:197)
        at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.buildXHTML(XSSFExcelExtractorDecorator.java:138)
        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:110)
        at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.getXHTML(XSSFExcelExtractorDecorator.java:97)
        at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
        at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
        at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:190)
        at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:491)
        at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:144)

<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="date" content="2016-04-11T13:45:08Z"/>
<meta name="extended-properties:AppVersion" content="15.0300"/>
<meta name="dc:creator" content="nick"/>
<meta name="extended-properties:Company" content=""/>
<meta name="dcterms:created" content="2016-01-05T14:53:37Z"/>
<meta name="Last-Modified" content="2016-04-11T13:45:08Z"/>
<meta name="dcterms:modified" content="2016-04-11T13:45:08Z"/>
<meta name="Last-Save-Date" content="2016-04-11T13:45:08Z"/>
<meta name="protected" content="false"/>
<meta name="meta:save-date" content="2016-04-11T13:45:08Z"/>
<meta name="Application-Name" content="Microsoft Excel"/>
<meta name="modified" content="2016-04-11T13:45:08Z"/>
<meta name="Content-Length" content="9119"/>
<meta name="Content-Type" content="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"/>
<meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser"/>
<meta name="X-Parsed-By" content="org.apache.tika.parser.microsoft.ooxml.OOXMLParser"/>
<meta name="creator" content="nick"/>
<meta name="meta:author" content="nick"/>
<meta name="meta:creation-date" content="2016-01-05T14:53:37Z"/>
<meta name="extended-properties:Application" content="Microsoft Excel"/>
<meta name="meta:last-author" content="Sam"/>
<meta name="Creation-Date" content="2016-01-05T14:53:37Z"/>
<meta name="resourceName" content="iae.xlsx"/>
<meta name="Last-Author" content="Sam"/>
<meta name="Application-Version" content="15.0300"/>
<meta name="Author" content="nick"/>
<meta name="publisher" content=""/>
<meta name="dc:publisher" content=""/>
<title/>
</head>
<body><div><h1>Sheet1</h1>
<table><tbody><tr>      <td>69.99</td></tr>
</tbody></table>
</div>
</body></html>
{code}

The real output is consistent with what I would expect (and with the output from version 1.12)

I would expect this exception to be handled another way, but not to show up (as text) in my
parsed output.

  was:
I tried parsing an Excel document, and noticed there was an IllegalArgumentException stacktrace
in the output.

I've traced this back to https://github.com/apache/tika/commit/25cee54499126de2b90f6bd5bde8de470b422349

Attached you can find my testfile.

This is the output, running 1.13-snapshot as jar
{code}
java -jar tika-app-1.13-SNAPSHOT.jar iae.xlsx


apr 11, 2016 3:56:26 PM org.apache.poi.ss.format.CellFormat <init>
WARNING: Invalid format: "_([$Ç-2]\ * #,##0.00_);"
java.lang.IllegalArgumentException: Unsupported [] format block '[' in '_([$Ç-2]\ * #,##0.00_)'
        at org.apache.poi.ss.format.CellFormatPart.formatType(CellFormatPart.java:362)
        at org.apache.poi.ss.format.CellFormatPart.getCellFormatType(CellFormatPart.java:276)
        at org.apache.poi.ss.format.CellFormatPart.<init>(CellFormatPart.java:180)
        at org.apache.poi.ss.format.CellFormat.<init>(CellFormat.java:167)
        at org.apache.poi.ss.format.CellFormat.getInstance(CellFormat.java:143)
        at org.apache.poi.ss.usermodel.DataFormatter.getFormat(DataFormatter.java:314)
        at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents(DataFormatter.java:797)
        at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents(DataFormatter.java:769)
        at org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler.endElement(XSSFSheetXMLHandler.java:354)
        at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator$XSSFSheetInterestingPartsCapturer.endElement(XSSFExcelExtractorDecorator.java:361)
        at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(Unknown
Source)
        at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(Unknown
Source)
        at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(Unknown
Source)
        at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(Unknown Source)
        at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
Source)
        at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)
        at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)
        at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(Unknown Source)
        at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(Unknown Source)
        at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown
Source)
        at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.processSheet(XSSFExcelExtractorDecorator.java:197)
        at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.buildXHTML(XSSFExcelExtractorDecorator.java:138)
        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:110)
        at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.getXHTML(XSSFExcelExtractorDecorator.java:97)
        at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
        at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
        at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:190)
        at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:491)
        at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:144)

apr 11, 2016 3:56:26 PM org.apache.poi.ss.format.CellFormat <init>
WARNING: Invalid format: "_([$Ç-2]\ * \(#,##0.00\);"
java.lang.IllegalArgumentException: Unsupported [] format block '[' in '_([$Ç-2]\ * \(#,##0.00\)'
        at org.apache.poi.ss.format.CellFormatPart.formatType(CellFormatPart.java:362)
        at org.apache.poi.ss.format.CellFormatPart.getCellFormatType(CellFormatPart.java:276)
        at org.apache.poi.ss.format.CellFormatPart.<init>(CellFormatPart.java:180)
        at org.apache.poi.ss.format.CellFormat.<init>(CellFormat.java:167)
        at org.apache.poi.ss.format.CellFormat.getInstance(CellFormat.java:143)
        at org.apache.poi.ss.usermodel.DataFormatter.getFormat(DataFormatter.java:314)
        at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents(DataFormatter.java:797)
        at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents(DataFormatter.java:769)
        at org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler.endElement(XSSFSheetXMLHandler.java:354)
        at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator$XSSFSheetInterestingPartsCapturer.endElement(XSSFExcelExtractorDecorator.java:361)
        at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(Unknown
Source)
        at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(Unknown
Source)
        at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(Unknown
Source)
        at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(Unknown Source)
        at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
Source)
        at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)
        at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)
        at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(Unknown Source)
        at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(Unknown Source)
        at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown
Source)
        at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.processSheet(XSSFExcelExtractorDecorator.java:197)
        at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.buildXHTML(XSSFExcelExtractorDecorator.java:138)
        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:110)
        at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.getXHTML(XSSFExcelExtractorDecorator.java:97)
        at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
        at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
        at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:190)
        at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:491)
        at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:144)

apr 11, 2016 3:56:26 PM org.apache.poi.ss.format.CellFormat <init>
WARNING: Invalid format: "_([$Ç-2]\ * "-"??_);"
java.lang.IllegalArgumentException: Unsupported [] format block '[' in '_([$Ç-2]\ * "-"??_)'
        at org.apache.poi.ss.format.CellFormatPart.formatType(CellFormatPart.java:362)
        at org.apache.poi.ss.format.CellFormatPart.getCellFormatType(CellFormatPart.java:276)
        at org.apache.poi.ss.format.CellFormatPart.<init>(CellFormatPart.java:180)
        at org.apache.poi.ss.format.CellFormat.<init>(CellFormat.java:167)
        at org.apache.poi.ss.format.CellFormat.getInstance(CellFormat.java:143)
        at org.apache.poi.ss.usermodel.DataFormatter.getFormat(DataFormatter.java:314)
        at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents(DataFormatter.java:797)
        at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents(DataFormatter.java:769)
        at org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler.endElement(XSSFSheetXMLHandler.java:354)
        at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator$XSSFSheetInterestingPartsCapturer.endElement(XSSFExcelExtractorDecorator.java:361)
        at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(Unknown
Source)
        at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(Unknown
Source)
        at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(Unknown
Source)
        at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(Unknown Source)
        at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
Source)
        at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)
        at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown Source)
        at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(Unknown Source)
        at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(Unknown Source)
        at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown
Source)
        at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.processSheet(XSSFExcelExtractorDecorator.java:197)
        at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.buildXHTML(XSSFExcelExtractorDecorator.java:138)
        at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:110)
        at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.getXHTML(XSSFExcelExtractorDecorator.java:97)
        at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
        at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
        at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:190)
        at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:491)
        at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:144)

<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="date" content="2016-04-11T13:45:08Z"/>
<meta name="extended-properties:AppVersion" content="15.0300"/>
<meta name="dc:creator" content="nick"/>
<meta name="extended-properties:Company" content=""/>
<meta name="dcterms:created" content="2016-01-05T14:53:37Z"/>
<meta name="Last-Modified" content="2016-04-11T13:45:08Z"/>
<meta name="dcterms:modified" content="2016-04-11T13:45:08Z"/>
<meta name="Last-Save-Date" content="2016-04-11T13:45:08Z"/>
<meta name="protected" content="false"/>
<meta name="meta:save-date" content="2016-04-11T13:45:08Z"/>
<meta name="Application-Name" content="Microsoft Excel"/>
<meta name="modified" content="2016-04-11T13:45:08Z"/>
<meta name="Content-Length" content="9119"/>
<meta name="Content-Type" content="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"/>
<meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser"/>
<meta name="X-Parsed-By" content="org.apache.tika.parser.microsoft.ooxml.OOXMLParser"/>
<meta name="creator" content="nick"/>
<meta name="meta:author" content="nick"/>
<meta name="meta:creation-date" content="2016-01-05T14:53:37Z"/>
<meta name="extended-properties:Application" content="Microsoft Excel"/>
<meta name="meta:last-author" content="Sam"/>
<meta name="Creation-Date" content="2016-01-05T14:53:37Z"/>
<meta name="resourceName" content="iae.xlsx"/>
<meta name="Last-Author" content="Sam"/>
<meta name="Application-Version" content="15.0300"/>
<meta name="Author" content="nick"/>
<meta name="publisher" content=""/>
<meta name="dc:publisher" content=""/>
<title/>
</head>
<body><div><h1>Sheet1</h1>
<table><tbody><tr>      <td>69.99</td></tr>
</tbody></table>
</div>
</body></html>
{code}

The real output is consistent with what I would expect (and with the output from version 1.12)

I would expect this exception to be handled another way, but not to show up (as text) in my
parsed output.


> IllegalArgumentException stacktrace in output since POI update
> --------------------------------------------------------------
>
>                 Key: TIKA-1947
>                 URL: https://issues.apache.org/jira/browse/TIKA-1947
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.13
>            Reporter: Sam H
>         Attachments: iae.xlsx
>
>
> I tried parsing an Excel document, and noticed there was an IllegalArgumentException
stacktrace in the output.
> I've traced this back to https://github.com/apache/tika/commit/25cee54499126de2b90f6bd5bde8de470b422349
> Attached you can find my testfile: iae.xlsx
> This is the output, running 1.13-snapshot as jar
> {code}
> java -jar tika-app-1.13-SNAPSHOT.jar iae.xlsx
> apr 11, 2016 3:56:26 PM org.apache.poi.ss.format.CellFormat <init>
> WARNING: Invalid format: "_([$Ç-2]\ * #,##0.00_);"
> java.lang.IllegalArgumentException: Unsupported [] format block '[' in '_([$Ç-2]\ *
#,##0.00_)'
>         at org.apache.poi.ss.format.CellFormatPart.formatType(CellFormatPart.java:362)
>         at org.apache.poi.ss.format.CellFormatPart.getCellFormatType(CellFormatPart.java:276)
>         at org.apache.poi.ss.format.CellFormatPart.<init>(CellFormatPart.java:180)
>         at org.apache.poi.ss.format.CellFormat.<init>(CellFormat.java:167)
>         at org.apache.poi.ss.format.CellFormat.getInstance(CellFormat.java:143)
>         at org.apache.poi.ss.usermodel.DataFormatter.getFormat(DataFormatter.java:314)
>         at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents(DataFormatter.java:797)
>         at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents(DataFormatter.java:769)
>         at org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler.endElement(XSSFSheetXMLHandler.java:354)
>         at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator$XSSFSheetInterestingPartsCapturer.endElement(XSSFExcelExtractorDecorator.java:361)
>         at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(Unknown
Source)
>         at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(Unknown
Source)
>         at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(Unknown
Source)
>         at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(Unknown
Source)
>         at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
Source)
>         at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown
Source)
>         at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown
Source)
>         at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(Unknown Source)
>         at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(Unknown
Source)
>         at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown
Source)
>         at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.processSheet(XSSFExcelExtractorDecorator.java:197)
>         at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.buildXHTML(XSSFExcelExtractorDecorator.java:138)
>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:110)
>         at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.getXHTML(XSSFExcelExtractorDecorator.java:97)
>         at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
>         at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>         at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:190)
>         at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:491)
>         at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:144)
> apr 11, 2016 3:56:26 PM org.apache.poi.ss.format.CellFormat <init>
> WARNING: Invalid format: "_([$Ç-2]\ * \(#,##0.00\);"
> java.lang.IllegalArgumentException: Unsupported [] format block '[' in '_([$Ç-2]\ *
\(#,##0.00\)'
>         at org.apache.poi.ss.format.CellFormatPart.formatType(CellFormatPart.java:362)
>         at org.apache.poi.ss.format.CellFormatPart.getCellFormatType(CellFormatPart.java:276)
>         at org.apache.poi.ss.format.CellFormatPart.<init>(CellFormatPart.java:180)
>         at org.apache.poi.ss.format.CellFormat.<init>(CellFormat.java:167)
>         at org.apache.poi.ss.format.CellFormat.getInstance(CellFormat.java:143)
>         at org.apache.poi.ss.usermodel.DataFormatter.getFormat(DataFormatter.java:314)
>         at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents(DataFormatter.java:797)
>         at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents(DataFormatter.java:769)
>         at org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler.endElement(XSSFSheetXMLHandler.java:354)
>         at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator$XSSFSheetInterestingPartsCapturer.endElement(XSSFExcelExtractorDecorator.java:361)
>         at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(Unknown
Source)
>         at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(Unknown
Source)
>         at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(Unknown
Source)
>         at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(Unknown
Source)
>         at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
Source)
>         at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown
Source)
>         at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown
Source)
>         at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(Unknown Source)
>         at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(Unknown
Source)
>         at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown
Source)
>         at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.processSheet(XSSFExcelExtractorDecorator.java:197)
>         at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.buildXHTML(XSSFExcelExtractorDecorator.java:138)
>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:110)
>         at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.getXHTML(XSSFExcelExtractorDecorator.java:97)
>         at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
>         at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>         at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:190)
>         at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:491)
>         at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:144)
> apr 11, 2016 3:56:26 PM org.apache.poi.ss.format.CellFormat <init>
> WARNING: Invalid format: "_([$Ç-2]\ * "-"??_);"
> java.lang.IllegalArgumentException: Unsupported [] format block '[' in '_([$Ç-2]\ *
"-"??_)'
>         at org.apache.poi.ss.format.CellFormatPart.formatType(CellFormatPart.java:362)
>         at org.apache.poi.ss.format.CellFormatPart.getCellFormatType(CellFormatPart.java:276)
>         at org.apache.poi.ss.format.CellFormatPart.<init>(CellFormatPart.java:180)
>         at org.apache.poi.ss.format.CellFormat.<init>(CellFormat.java:167)
>         at org.apache.poi.ss.format.CellFormat.getInstance(CellFormat.java:143)
>         at org.apache.poi.ss.usermodel.DataFormatter.getFormat(DataFormatter.java:314)
>         at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents(DataFormatter.java:797)
>         at org.apache.poi.ss.usermodel.DataFormatter.formatRawCellContents(DataFormatter.java:769)
>         at org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler.endElement(XSSFSheetXMLHandler.java:354)
>         at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator$XSSFSheetInterestingPartsCapturer.endElement(XSSFExcelExtractorDecorator.java:361)
>         at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(Unknown
Source)
>         at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(Unknown
Source)
>         at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(Unknown
Source)
>         at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(Unknown
Source)
>         at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
Source)
>         at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown
Source)
>         at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(Unknown
Source)
>         at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(Unknown Source)
>         at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(Unknown
Source)
>         at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown
Source)
>         at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.processSheet(XSSFExcelExtractorDecorator.java:197)
>         at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.buildXHTML(XSSFExcelExtractorDecorator.java:138)
>         at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:110)
>         at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator.getXHTML(XSSFExcelExtractorDecorator.java:97)
>         at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:112)
>         at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:87)
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>         at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>         at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
>         at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:190)
>         at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:491)
>         at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:144)
> <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
> <head>
> <meta name="date" content="2016-04-11T13:45:08Z"/>
> <meta name="extended-properties:AppVersion" content="15.0300"/>
> <meta name="dc:creator" content="nick"/>
> <meta name="extended-properties:Company" content=""/>
> <meta name="dcterms:created" content="2016-01-05T14:53:37Z"/>
> <meta name="Last-Modified" content="2016-04-11T13:45:08Z"/>
> <meta name="dcterms:modified" content="2016-04-11T13:45:08Z"/>
> <meta name="Last-Save-Date" content="2016-04-11T13:45:08Z"/>
> <meta name="protected" content="false"/>
> <meta name="meta:save-date" content="2016-04-11T13:45:08Z"/>
> <meta name="Application-Name" content="Microsoft Excel"/>
> <meta name="modified" content="2016-04-11T13:45:08Z"/>
> <meta name="Content-Length" content="9119"/>
> <meta name="Content-Type" content="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"/>
> <meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser"/>
> <meta name="X-Parsed-By" content="org.apache.tika.parser.microsoft.ooxml.OOXMLParser"/>
> <meta name="creator" content="nick"/>
> <meta name="meta:author" content="nick"/>
> <meta name="meta:creation-date" content="2016-01-05T14:53:37Z"/>
> <meta name="extended-properties:Application" content="Microsoft Excel"/>
> <meta name="meta:last-author" content="Sam"/>
> <meta name="Creation-Date" content="2016-01-05T14:53:37Z"/>
> <meta name="resourceName" content="iae.xlsx"/>
> <meta name="Last-Author" content="Sam"/>
> <meta name="Application-Version" content="15.0300"/>
> <meta name="Author" content="nick"/>
> <meta name="publisher" content=""/>
> <meta name="dc:publisher" content=""/>
> <title/>
> </head>
> <body><div><h1>Sheet1</h1>
> <table><tbody><tr>      <td>69.99</td></tr>
> </tbody></table>
> </div>
> </body></html>
> {code}
> The real output is consistent with what I would expect (and with the output from version
1.12)
> I would expect this exception to be handled another way, but not to show up (as text)
in my parsed output.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message