tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Slava G (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (TIKA-2727) Parsing and detect mime type of XML file stuck in infinite loop
Date Mon, 24 Sep 2018 20:35:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16626408#comment-16626408
] 

Slava G edited comment on TIKA-2727 at 9/24/18 8:34 PM:
--------------------------------------------------------

Tried to reproduce, after few hundreds xml that was transfer to TIKA for parsing, it's hanged
out,

I tried to do same file in the loop, while file name don't have .xml extension.

10 times it was parsed as : text/plain

On the 11'th it's stuck:

at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
 at org.apache.xerces.impl.XMLEntityScanner.scanLiteral(Unknown Source)
 at org.apache.xerces.impl.XMLScanner.scanAttributeValue(Unknown Source)
 at org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanAttribute(Unknown Source)
 at org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanStartElement(Unknown Source)
 at org.apache.xerces.impl.XMLNSDocumentScannerImpl$NSContentDispatcher.scanRootElementHook(Unknown
Source)
 at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
Source)
 at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
 at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
 at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
 at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
 at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
 at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
 at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
 at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
 at org.apache.tika.utils.XMLReaderUtils.parseSAX(XMLReaderUtils.java:371)
 at org.apache.tika.detect.XmlRootExtractor.extractRootElement(XmlRootExtractor.java:53)
 at org.apache.tika.detect.XmlRootExtractor.extractRootElement(XmlRootExtractor.java:44)
 at org.apache.tika.mime.MimeTypes.getMimeType(MimeTypes.java:212)
 at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:493)
 at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:84)


was (Author: slavago):
Tried to reproduce, after few hundreds xml that was transfer to TIKA for parsing, it's hanged
out:

at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
 at org.apache.xerces.impl.XMLEntityScanner.scanLiteral(Unknown Source)
 at org.apache.xerces.impl.XMLScanner.scanAttributeValue(Unknown Source)
 at org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanAttribute(Unknown Source)
 at org.apache.xerces.impl.XMLNSDocumentScannerImpl.scanStartElement(Unknown Source)
 at org.apache.xerces.impl.XMLNSDocumentScannerImpl$NSContentDispatcher.scanRootElementHook(Unknown
Source)
 at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
Source)
 at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
 at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
 at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
 at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
 at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
 at org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
 at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
 at javax.xml.parsers.SAXParser.parse(SAXParser.java:195)
 at org.apache.tika.utils.XMLReaderUtils.parseSAX(XMLReaderUtils.java:371)
 at org.apache.tika.detect.XmlRootExtractor.extractRootElement(XmlRootExtractor.java:53)
 at org.apache.tika.detect.XmlRootExtractor.extractRootElement(XmlRootExtractor.java:44)
 at org.apache.tika.mime.MimeTypes.getMimeType(MimeTypes.java:212)
 at org.apache.tika.mime.MimeTypes.detect(MimeTypes.java:493)
 at org.apache.tika.detect.CompositeDetector.detect(CompositeDetector.java:84)

> Parsing and detect mime type of XML file stuck in infinite loop
> ---------------------------------------------------------------
>
>                 Key: TIKA-2727
>                 URL: https://issues.apache.org/jira/browse/TIKA-2727
>             Project: Tika
>          Issue Type: Bug
>          Components: detector, parser
>    Affects Versions: 1.17
>            Reporter: Slava G
>            Assignee: Tim Allison
>            Priority: Major
>             Fix For: 1.19, 2.0.0
>
>         Attachments: 1_e3e13f0e-7085-4000-a558-5d255ed7a944.xml
>
>
> Hi,
> I'm trying to parse (even mime type detect) some XML file that it's not large, but kinda
tricky and my process hangs on :
> XMLStringBuffer.append(char[], int, int) line: not available 
> XMLStringBuffer.append(XMLString) line: not available 
> XMLNSDocumentScannerImpl(XMLScanner).scanAttributeValue(XMLString, XMLString, String,
boolean, String) line: not available 
> XMLNSDocumentScannerImpl.scanAttribute(XMLAttributesImpl) line: not available 
> XMLNSDocumentScannerImpl.scanStartElement() line: not available 
> XMLNSDocumentScannerImpl$NSContentDispatcher.scanRootElementHook() line: not available

> XMLNSDocumentScannerImpl$NSContentDispatcher(XMLDocumentFragmentScannerImpl$FragmentContentDispatcher).dispatch(boolean)
line: not available 
> XMLNSDocumentScannerImpl(XMLDocumentFragmentScannerImpl).scanDocument(boolean) line:
not available 
> XIncludeAwareParserConfiguration(XML11Configuration).parse(boolean) line: not available

> XIncludeAwareParserConfiguration(XML11Configuration).parse(XMLInputSource) line: not
available 
> SAXParserImpl$JAXPSAXParser(XMLParser).parse(XMLInputSource) line: not available 
> SAXParserImpl$JAXPSAXParser(AbstractSAXParser).parse(InputSource) line: not available

> SAXParserImpl$JAXPSAXParser.parse(InputSource) line: not available 
> SAXParserImpl.parse(InputSource, DefaultHandler) line: not available 
> SAXParserImpl(SAXParser).parse(InputStream, DefaultHandler) line: 195 
> XmlRootExtractor.extractRootElement(InputStream) line: 62 
> XmlRootExtractor.extractRootElement(byte[]) line: 42 
> MimeTypes.getMimeType(byte[]) line: 212 
> MimeTypes.detect(InputStream, Metadata) line: 494 
> DefaultDetector(CompositeDetector).detect(InputStream, Metadata) line: 84
>  
> Please see attached XML file.
> Please advise.
> Thanks



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message