tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Md (JIRA)" <j...@apache.org>
Subject [jira] [Created] (TIKA-2901) Tika extracting points from Chart
Date Mon, 08 Jul 2019 16:37:00 GMT
Md created TIKA-2901:
------------------------

             Summary: Tika extracting points from Chart 
                 Key: TIKA-2901
                 URL: https://issues.apache.org/jira/browse/TIKA-2901
             Project: Tika
          Issue Type: Bug
          Components: app
    Affects Versions: 1.21
            Reporter: Md


I am using Tika to extract content from *.docx and other files. I am noticing Tika is extracting
points from charts and putting them at the end of the file. 
I am using following code for extraction 
{code:java}
     StringBuilder fileContent = new StringBuilder();
        Parser parser = new AutoDetectParser();
        ContentHandlerFactory factory = new BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.HTML,
                -1);
        //InputStream inputStream = new BufferedInputStream(new FileInputStream(inputFileName));
        RecursiveParserWrapper wrapper = new RecursiveParserWrapper(parser, factory);
        Metadata metadata = new Metadata();

        ParseContext parseContext = new ParseContext();
        OfficeParserConfig officeParserConfig = new OfficeParserConfig();
        officeParserConfig.setUseSAXDocxExtractor(true);
        officeParserConfig.setIncludeDeletedContent(false);
        officeParserConfig.setIncludeMoveFromContent(false);
        officeParserConfig.setIncludeHeadersAndFooters(false);
        parseContext.set(OfficeParserConfig.class, officeParserConfig);

        wrapper.parse(inputStream, new DefaultHandler(), metadata, parseContext);
        String contents = metadata.get(RecursiveParserWrapper.TIKA_CONTENT);
        {code}

Please find the attach files for input and output from Tika. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message