From dev-return-31534-apmail-tika-dev-archive=tika.apache.org@tika.apache.org Wed Sep 4 15:11:03 2019 Return-Path: X-Original-To: apmail-tika-dev-archive@www.apache.org Delivered-To: apmail-tika-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by minotaur.apache.org (Postfix) with SMTP id A34E0198CC for ; Wed, 4 Sep 2019 15:11:02 +0000 (UTC) Received: (qmail 69319 invoked by uid 500); 4 Sep 2019 20:29:11 -0000 Delivered-To: apmail-tika-dev-archive@tika.apache.org Received: (qmail 69274 invoked by uid 500); 4 Sep 2019 20:29:11 -0000 Mailing-List: contact dev-help@tika.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@tika.apache.org Delivered-To: mailing list dev@tika.apache.org Received: (qmail 69263 invoked by uid 99); 4 Sep 2019 20:29:11 -0000 Received: from mailrelay1-us-west.apache.org (HELO mailrelay1-us-west.apache.org) (209.188.14.139) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 04 Sep 2019 20:29:11 +0000 Received: from jira-he-de.apache.org (static.172.67.40.188.clients.your-server.de [188.40.67.172]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 1D3FEE3115 for ; Wed, 4 Sep 2019 15:11:01 +0000 (UTC) Received: from jira-he-de.apache.org (localhost.localdomain [127.0.0.1]) by jira-he-de.apache.org (ASF Mail Server at jira-he-de.apache.org) with ESMTP id 1D2D5781F24 for ; Wed, 4 Sep 2019 15:11:00 +0000 (UTC) Date: Wed, 4 Sep 2019 15:11:00 +0000 (UTC) From: "Tim Allison (Jira)" To: dev@tika.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (TIKA-2934) OOXML parser fails to parse XLSX files with missing cellRef properties MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/TIKA-2934?page=3Dcom.atlassian.= jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D16922= 591#comment-16922591 ]=20 Tim Allison commented on TIKA-2934: ----------------------------------- Thank you for opening this issue. Would you be able to share a triggering = file with us? > OOXML parser fails to parse XLSX files with missing cellRef properties > ---------------------------------------------------------------------- > > Key: TIKA-2934 > URL: https://issues.apache.org/jira/browse/TIKA-2934 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 1.22 > Reporter: Yahav Amsalem > Priority: Major > > A NullPointerException is thrown when parsing xlsx documents that don=E2= =80=99t have CellRef property:=C2=A0 > {code:java} > Caused by: java.lang.NullPointerException: null > at org.apache.poi.util.StringUtil.endsWithIgnoreCase(StringUtil.java:= 317) > at org.apache.poi.ss.util.CellReference.(CellReference.java:109= ) > at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator= $SheetTextAsHTML.cell(XSSFExcelExtractorDecorator.java:452) > at org.apache.poi.xssf.eventusermodel.XSSFSheetXMLHandler.endElement(= XSSFSheetXMLHandler.java:379) > at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator= $XSSFSheetInterestingPartsCapturer.endElement(XSSFExcelExtractorDecorator.j= ava:553) > at org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Sou= rce) > at org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanEndElement(Unk= nown Source) > at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentCont= entDispatcher.dispatch(Unknown Source) > at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument= (Unknown Source) > at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) > at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) > at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) > at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) > at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown S= ource) > at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source) > at javax.xml.parsers.SAXParser.parse(SAXParser.java:195) > at org.apache.tika.utils.XMLReaderUtils.parseSAX(XMLReaderUtils.java:= 452) > at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator= .processSheet(XSSFExcelExtractorDecorator.java:352) > at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator= .buildXHTML(XSSFExcelExtractorDecorator.java:168) > at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getX= HTML(AbstractOOXMLExtractor.java:136) > at org.apache.tika.parser.microsoft.ooxml.XSSFExcelExtractorDecorator= .getXHTML(XSSFExcelExtractorDecorator.java:122) > at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse= (OOXMLExtractorFactory.java:201) > at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLPars= er.java:110) > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:= 280) > {code} > According to latest OOXML standard ECMA-376 5th edition Part 1 (released = on December 2016), Cell Reference (18.18.7, ST_CellRef) property on a Cell = (18.3.1.4, CT_Cell) is optional.=C2=A0 > Actually, we believe an abandoned pull request was supposed to fix this i= ssue and it wasn=E2=80=99t merged eventually: [https://github.com/apache/ti= ka/pull/214/commits/d79aa3baf33d4f859e4daa8ef251721f3ac2a198.]=C2=A0Look at= the safety block commented with: > {code:java} > // gracefully handle missing CellRef here in a similar way as XSSFCell do= es{code} > =C2=A0 > =C2=A0 -- This message was sent by Atlassian Jira (v8.3.2#803003)