tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1715) Save embedded images into another location
Date Mon, 24 Aug 2015 18:52:45 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14709848#comment-14709848
] 

Tim Allison commented on TIKA-1715:
-----------------------------------

If this is a usage question, probably better to ask on user@tika.apache.org.

The RecursiveParserWrapper is only to be used for extraction of text content and metadata,
not actual bytes.

To see an example of how to extract the bytes of embedded files, see this [example|https://svn.apache.org/viewvc/tika/trunk/tika-example/src/main/java/org/apache/tika/example/ExtractEmbeddedFiles.java?revision=1696751&view=markup].

Note that the caveat on TIKA-1674 still applies...this approach only extracts the bytes of
the immediate children of the main document.  It will not pull out the grandchildren of the
main document, etc.  This is the current behavior of tika-app.jar's -z option and tika-server's
/unpack endpoint.

As for the speed, yes, it can be slow on some files.  That's why we chose not to extract inline
images by default.  If you are finding better performance with PDFBox's ExtractImages, let
us know!



> Save embedded images into another location
> ------------------------------------------
>
>                 Key: TIKA-1715
>                 URL: https://issues.apache.org/jira/browse/TIKA-1715
>             Project: Tika
>          Issue Type: Test
>          Components: metadata
>    Affects Versions: 1.10
>            Reporter: Damiano
>              Labels: newbie
>
> Hello,
> I am having a strange problem deadling with embedded images.
> This is my code:
> {code:xml}
>     public void getImages() throws IOException, TikaException, SAXException {
>         
>         try (InputStream stream = new FileInputStream(this.fileName)) {
>             RecursiveParserWrapper p = new RecursiveParserWrapper(
>                 new AutoDetectParser(),
>                 new BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.IGNORE,
-1)
>             );            
>             
>             ParseContext context = new ParseContext();
>             PDFParserConfig config = new PDFParserConfig();
>             config.setExtractInlineImages(true);
>             config.setExtractUniqueInlineImagesOnly(true);
>             context.set(org.apache.tika.parser.pdf.PDFParserConfig.class, config);
>             context.set(org.apache.tika.parser.Parser.class, p);            
>             
>             p.parse(stream, new BodyContentHandler(-1), new Metadata(), context);
>             
>             List<Metadata> metadatas = p.getMetadata();
>                         
>             FileInputStream f = new FileInputStream("/tmp/" + metadatas.get(1).get("File
Name"));
>             //FileInputStream f = new FileInputStream(metadatas.get(1).get("File Name"));
>             
>             System.out.println(f.available());
>         }
>     }
> {code}
> I can get the name of the embedded images with get("File Name") but the path seems invalid.
> I need to save all the embedded images (inline images) to another location.
> Thank you in advance!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message