tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Md (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (TIKA-2901) Tika extracting points data from Chart
Date Mon, 08 Jul 2019 16:45:00 GMT

     [ https://issues.apache.org/jira/browse/TIKA-2901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Md updated TIKA-2901:
---------------------
    Summary: Tika extracting points data from Chart   (was: Tika extracting points from Chart
)

> Tika extracting points data from Chart 
> ---------------------------------------
>
>                 Key: TIKA-2901
>                 URL: https://issues.apache.org/jira/browse/TIKA-2901
>             Project: Tika
>          Issue Type: Bug
>          Components: app
>    Affects Versions: 1.21
>            Reporter: Md
>            Priority: Major
>         Attachments: Chart_data_sample_text_possible_issue.docx, Chart_data_sample_text_possible_issue.docx.txt
>
>
> I am using Tika to extract content from *.docx and other files. I am noticing Tika is
extracting points from charts and putting them at the end of the file. 
> I am using following code for extraction 
> {code:java}
>      StringBuilder fileContent = new StringBuilder();
>         Parser parser = new AutoDetectParser();
>         ContentHandlerFactory factory = new BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.HTML,
>                 -1);
>         //InputStream inputStream = new BufferedInputStream(new FileInputStream(inputFileName));
>         RecursiveParserWrapper wrapper = new RecursiveParserWrapper(parser, factory);
>         Metadata metadata = new Metadata();
>         ParseContext parseContext = new ParseContext();
>         OfficeParserConfig officeParserConfig = new OfficeParserConfig();
>         officeParserConfig.setUseSAXDocxExtractor(true);
>         officeParserConfig.setIncludeDeletedContent(false);
>         officeParserConfig.setIncludeMoveFromContent(false);
>         officeParserConfig.setIncludeHeadersAndFooters(false);
>         parseContext.set(OfficeParserConfig.class, officeParserConfig);
>         wrapper.parse(inputStream, new DefaultHandler(), metadata, parseContext);
>         String contents = metadata.get(RecursiveParserWrapper.TIKA_CONTENT);
>         {code}
> Please find the attach files for input and output from Tika. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message