tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (TIKA-1764) Provide information on failed document parsing in ParsingEmbeddedDocumentExtractor
Date Fri, 09 Oct 2015 11:57:26 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14950277#comment-14950277
] 

Tim Allison edited comment on TIKA-1764 at 10/9/15 11:57 AM:
-------------------------------------------------------------

Y, I completely agree that we all need to see when embedded documents are failing.  The RecursiveParserWrapper
allowed me to discover TIKA-1651, for example, and I suspect that there are lots of other
discoveries to be made with embedded objects.

I think I now remember why I haven't gotten around to fixing this...

The problem with logging the full metadata value at that point in the code is that there is
no container document information in the metadata object at that point of the parsing via
the standard AutoDetectParser.  So, all you'd get would be the detected mime type, the embedded
object's name and any metadata that was pulled out before the parse failed.  In short, without
other changes in our code, there would be no way to link that stacktrace or the metadata back
to the source document with the AutoDetectParser.

I (re)tested this just now to confirm.  I truncated a ppt file and zipped it up.  This is
what I got at that point in the code:
{noformat}
inside parsingEmbdeddedExtractor: date ; 2015-10-09T11:27:28Z
inside parsingEmbdeddedExtractor: embeddedRelationshipId ; testPPT_comment_broken.ppt
inside parsingEmbdeddedExtractor: X-Parsed-By ; org.apache.tika.parser.DefaultParser
inside parsingEmbdeddedExtractor: meta:save-date ; 2015-10-09T11:27:28Z
inside parsingEmbdeddedExtractor: modified ; 2015-10-09T11:27:28Z
inside parsingEmbdeddedExtractor: resourceName ; testPPT_comment_broken.ppt
inside parsingEmbdeddedExtractor: dcterms:modified ; 2015-10-09T11:27:28Z
inside parsingEmbdeddedExtractor: Last-Modified ; 2015-10-09T11:27:28Z
inside parsingEmbdeddedExtractor: Content-Length ; 63760
inside parsingEmbdeddedExtractor: Last-Save-Date ; 2015-10-09T11:27:28Z
inside parsingEmbdeddedExtractor: Content-Type ; application/vnd.ms-powerpoint
{noformat}

So, the only way to get the container doc's information would be to cache it as you're parsing
the embedded documents and transmit that information through the ParseContext.  This is exactly
what the RecursiveMetadataParser does, so I'm not sure that we'd want to modify anything within
Tika to solve this problem because I think the existing solution is sufficient.

If you're using Solr Cell... I opened a ticket a while ago to parameterize the use of the
RecursiveMetadataParser in Solr Cell/DIH (SOLR-7229), but I haven't worked on it at all. 
If you'd like to help on that by giving feedback on what you'd need, I think the Solr community
would be receptive.  We had very quick commits on SOLR-7189 and SOLR-7231.

As a side note, I would very strongly encourage you to support SOLR-7632 and move Tika out
of the same JVM that is sending updates to Solr.  I don't think this should be the default,
but I do think that users should be able to configure the use of tika-server instead of the
current embedded use of Tika.

Finally, speaking of embedded documents, if you have any friends over on Kite, I'd encourage
them to look at Kite's failure to handle embedded documents [here|https://github.com/kite-sdk/kite/issues/397].
 There's every chance they've fixed this by now, but as of July, no dice.


 



was (Author: tallison@mitre.org):
Y, I completely agree that we all need to see when embedded documents are failing.  The RecursiveParserWrapper
allowed me to discover TIKA-1651, for example, and I suspect that there are lots of other
discoveries to be made with embedded objects.

I think I now remember why I haven't gotten around to fixing this...

The problem with logging the full metadata value at that point in the code is that there is
no container document information in the metadata object at that point of the parsing via
the standard AutoDetectParser.  So, all you'd get would be the detected mime type, the embedded
object's name and any metadata that was pulled out before the parse failed.  In short, without
other changes in our code, there would be no way to link that stacktrace back to the source
document with the AutoDetectParser.

I (re)tested this just now to confirm.  I truncated a ppt file and zipped it up.  This is
what I got at that point in the code:
{noformat}
inside parsingEmbdeddedExtractor: date ; 2015-10-09T11:27:28Z
inside parsingEmbdeddedExtractor: embeddedRelationshipId ; testPPT_comment_broken.ppt
inside parsingEmbdeddedExtractor: X-Parsed-By ; org.apache.tika.parser.DefaultParser
inside parsingEmbdeddedExtractor: meta:save-date ; 2015-10-09T11:27:28Z
inside parsingEmbdeddedExtractor: modified ; 2015-10-09T11:27:28Z
inside parsingEmbdeddedExtractor: resourceName ; testPPT_comment_broken.ppt
inside parsingEmbdeddedExtractor: dcterms:modified ; 2015-10-09T11:27:28Z
inside parsingEmbdeddedExtractor: Last-Modified ; 2015-10-09T11:27:28Z
inside parsingEmbdeddedExtractor: Content-Length ; 63760
inside parsingEmbdeddedExtractor: Last-Save-Date ; 2015-10-09T11:27:28Z
inside parsingEmbdeddedExtractor: Content-Type ; application/vnd.ms-powerpoint
{noformat}

So, the only way to get the container doc's information would be to cache it as you're parsing
the embedded documents and transmit that information through the ParseContext.  This is exactly
what the RecursiveMetadataParser does, so I'm not sure that we'd want to modify anything within
Tika to solve this problem because I think the existing solution is sufficient.

If you're using Solr Cell... I opened a ticket a while ago to parameterize the use of the
RecursiveMetadataParser in Solr Cell/DIH (SOLR-7229), but I haven't worked on it at all. 
If you'd like to help on that by giving feedback on what you'd need, I think the Solr community
would be receptive.  We had very quick commits on SOLR-7189 and SOLR-7231.

As a side note, I would very strongly encourage you to support SOLR-7632 and move Tika out
of the same JVM that is sending updates to Solr.  I don't think this should be the default,
but I do think that users should be able to configure the use of tika-server instead of the
current embedded use of Tika.

Finally, speaking of embedded documents, if you have any friends over on Kite, I'd encourage
them to look at Kite's failure to handle embedded documents [here|https://github.com/kite-sdk/kite/issues/397].
 There's every chance they've fixed this by now, but as of July, no dice.


 


> Provide information on failed document parsing in ParsingEmbeddedDocumentExtractor
> ----------------------------------------------------------------------------------
>
>                 Key: TIKA-1764
>                 URL: https://issues.apache.org/jira/browse/TIKA-1764
>             Project: Tika
>          Issue Type: Improvement
>    Affects Versions: 1.5, 1.10
>            Reporter: Odilo Oehmichen
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> The {{ParsingEmbeddedDocumentExtractor}} delegates the parsing of documents to a {{Parser}}-instance.
 
> If this parser fails with a {{TikaException}} the extractor class returns silenty:
> {code}
>  catch (TikaException e) {
>             // TODO: can we log a warning somehow?
>             // Could not parse the entry, just skip the content
>         }
> {code}
> This behaviour makes it very hard to detect problems concerning parsing.
> As the {{TODO}} in the source already states, please a some logging of the exception
here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message