Md created TIKA-2901:
------------------------
Summary: Tika extracting points from Chart
Key: TIKA-2901
URL: https://issues.apache.org/jira/browse/TIKA-2901
Project: Tika
Issue Type: Bug
Components: app
Affects Versions: 1.21
Reporter: Md
I am using Tika to extract content from *.docx and other files. I am noticing Tika is extracting
points from charts and putting them at the end of the file.
I am using following code for extraction
{code:java}
StringBuilder fileContent = new StringBuilder();
Parser parser = new AutoDetectParser();
ContentHandlerFactory factory = new BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.HTML,
-1);
//InputStream inputStream = new BufferedInputStream(new FileInputStream(inputFileName));
RecursiveParserWrapper wrapper = new RecursiveParserWrapper(parser, factory);
Metadata metadata = new Metadata();
ParseContext parseContext = new ParseContext();
OfficeParserConfig officeParserConfig = new OfficeParserConfig();
officeParserConfig.setUseSAXDocxExtractor(true);
officeParserConfig.setIncludeDeletedContent(false);
officeParserConfig.setIncludeMoveFromContent(false);
officeParserConfig.setIncludeHeadersAndFooters(false);
parseContext.set(OfficeParserConfig.class, officeParserConfig);
wrapper.parse(inputStream, new DefaultHandler(), metadata, parseContext);
String contents = metadata.get(RecursiveParserWrapper.TIKA_CONTENT);
{code}
Please find the attach files for input and output from Tika.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
|