tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris A. Mattmann (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (TIKA-674) CompositeParser should indicate which parser was actually selected for parsing
Date Sat, 30 Aug 2014 19:59:53 GMT

     [ https://issues.apache.org/jira/browse/TIKA-674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Chris A. Mattmann updated TIKA-674:
-----------------------------------

    Fix Version/s: 1.6

> CompositeParser should indicate which parser was actually selected for parsing
> ------------------------------------------------------------------------------
>
>                 Key: TIKA-674
>                 URL: https://issues.apache.org/jira/browse/TIKA-674
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.10
>            Reporter: Andrzej Bialecki 
>            Assignee: Chris A. Mattmann
>             Fix For: 1.6
>
>
> If multiple parsers exist that support the same mime type, and AutoDetectParser (or another
CompositeParser) is used, then the parse output does not indicate which of the alternative
parsers was actually used. I think that the name of the parser (FQCN?) should be added to
the metadata.
> Something like this trivial patch:
> {code}
> Index: tika-core/src/main/java/org/apache/tika/parser/CompositeParser.java
> ===================================================================
> --- tika-core/src/main/java/org/apache/tika/parser/CompositeParser.java	(revision 1135167)
> +++ tika-core/src/main/java/org/apache/tika/parser/CompositeParser.java	(working copy)
> @@ -238,6 +238,7 @@
>          try {
>              TikaInputStream taggedStream = TikaInputStream.get(stream, tmp);
>              TaggedContentHandler taggedHandler = new TaggedContentHandler(handler);
> +            metadata.add("X-Parsed-By", parser.getClass().getName());
>              try {
>                  parser.parse(taggedStream, taggedHandler, metadata, context);
>              } catch (RuntimeException e) {
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message