tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2309) New Detector and Parser classes for Time Stamped Data Envelope file format
Date Tue, 04 Apr 2017 14:38:42 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15955205#comment-15955205

ASF GitHub Bot commented on TIKA-2309:

tballison commented on issue #161: fix for TIKA-2309 contributed by Shinobi@75
URL: https://github.com/apache/tika/pull/161#issuecomment-291520466
   If I understand correctly, the TSD is an evelope file that contains another actual file.
 For example your first test file had the TSD envelope, but then it contained an xml file:
   `<?xml version="1.0" encoding="UTF-8"?>
       <manifest ID="9570">
           <man:manifestConservazione Id="ManifestConservazioneCNN"`
   I see that your updated .pdf file also has an envelope and then the raw bytes for a PDF
   You'll probably want to cache those bytes in a byte[] and then call the embedded parser,
something like:
   embeddedDocumentExtractor = EmbeddedDocumentUtil.getEmbeddedDocumentExtractor(context);
   if (!embeddedDocumentExtractor.shouldParseEmbedded(embeddedMetadata)) {
           TikaInputStream stream = TikaInputStream.get(cachedBytes);
           try {
                       new EmbeddedContentHandler(xhtml),
                       embeddedMetadata, false);
           } finally {
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:

> New Detector and Parser classes for Time Stamped Data Envelope file format
> --------------------------------------------------------------------------
>                 Key: TIKA-2309
>                 URL: https://issues.apache.org/jira/browse/TIKA-2309
>             Project: Tika
>          Issue Type: Improvement
>          Components: detector, parser
>    Affects Versions: 1.13, 1.14
>            Reporter: Fabio
>            Priority: Minor
>         Attachments: MANIFEST.XML.TSD
> Hello,
> I'm Fabio Evangelista from Rome. I'm working for an italian Public Administration company
and i'm using Apache Tika in my Java applications to detect and parse a broad kinds of file
formats. During that activity, after following your good guide on Tika project page, I've
made with success new type of Detector and Parser classes for a particular crypto timestamp
type with these caracteristics:
> Format name:               Time Stamped Data Envelope
> Mime Type:                   application/timestamped-data
> File extension:              .tsd
> TSD file hax magic code at the start of the file:   30 80 06 0B 2A 86 48 86 F7
> I've integrated and tested successfully with my applications those new classes in Tika
1.13 tika-core.jar and tika-parsers.jar. What should I do to submit my new classes to you?
Should I to push those in a particular git branch or, is there a particular process to follow
to submit my classes?
> Thank you for you patience and best regards.
> Fabio.

This message was sent by Atlassian JIRA

View raw message