[ https://issues.apache.org/jira/browse/TIKA-2900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Md updated TIKA-2900:
---------------------
Description:
Hello,
I do use Apache tika to extract text from mostly *.doc, *docx and *pdf files. Sometimes there
are comments in the file and tika is extracting them and adding them at the end of the file.
I am wondering to know is there a way to exclude comments when it will be extracting text.
Here is the following code I am using
{code:java}
StringBuilder fileContent = new StringBuilder();
Parser parser = new AutoDetectParser();
ContentHandlerFactory factory = new BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.HTML,
-1);
//InputStream inputStream = new BufferedInputStream(new FileInputStream(inputFileName));
RecursiveParserWrapper wrapper = new RecursiveParserWrapper(parser, factory);
Metadata metadata = new Metadata();
ParseContext parseContext = new ParseContext();
OfficeParserConfig officeParserConfig = new OfficeParserConfig();
officeParserConfig.setUseSAXDocxExtractor(true);
officeParserConfig.setIncludeDeletedContent(false);
officeParserConfig.setIncludeMoveFromContent(false);
officeParserConfig.setIncludeHeadersAndFooters(false);
parseContext.set(OfficeParserConfig.class, officeParserConfig);
wrapper.parse(inputStream, new DefaultHandler(), metadata, parseContext);
String contents = metadata.get(RecursiveParserWrapper.TIKA_CONTENT);
{code}
was:
Hello,
I do use Apache tika to extract text from mostly *.doc, *docx and *pdf files. Sometimes there
are comments in the file and tika is extracting them and adding them at the end of the file.
I am wondering to know is there a way to exclude comments when it will be extracting text.
Here is the following code I am using
```
StringBuilder fileContent = new StringBuilder();
Parserparser=newAutoDetectParser();
ContentHandlerFactoryfactory=newBasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.HTML,
-1);
//InputStream inputStream = new BufferedInputStream(new FileInputStream(inputFileName));
RecursiveParserWrapperwrapper=newRecursiveParserWrapper(parser, factory);
Metadatametadata=newMetadata();
ParseContextparseContext=newParseContext();
OfficeParserConfigofficeParserConfig=newOfficeParserConfig();
officeParserConfig.setUseSAXDocxExtractor(true);
officeParserConfig.setIncludeDeletedContent(false);
officeParserConfig.setIncludeMoveFromContent(false);
officeParserConfig.setIncludeHeadersAndFooters(false);
parseContext.set(OfficeParserConfig.class, officeParserConfig);
wrapper.parse(inputStream, newDefaultHandler(), metadata, parseContext);
Stringcontents=metadata.get(RecursiveParserWrapper.TIKA_CONTENT);
```
> Removing comments from *.docx, *.pdf files
> ------------------------------------------
>
> Key: TIKA-2900
> URL: https://issues.apache.org/jira/browse/TIKA-2900
> Project: Tika
> Issue Type: Wish
> Components: app, example
> Affects Versions: 1.21
> Reporter: Md
> Priority: Major
>
> Hello,
> I do use Apache tika to extract text from mostly *.doc, *docx and *pdf files. Sometimes
there are comments in the file and tika is extracting them and adding them at the end of the
file. I am wondering to know is there a way to exclude comments when it will be extracting
text.
> Here is the following code I am using
> {code:java}
> StringBuilder fileContent = new StringBuilder();
> Parser parser = new AutoDetectParser();
> ContentHandlerFactory factory = new BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.HTML,
> -1);
> //InputStream inputStream = new BufferedInputStream(new FileInputStream(inputFileName));
> RecursiveParserWrapper wrapper = new RecursiveParserWrapper(parser, factory);
> Metadata metadata = new Metadata();
> ParseContext parseContext = new ParseContext();
> OfficeParserConfig officeParserConfig = new OfficeParserConfig();
> officeParserConfig.setUseSAXDocxExtractor(true);
> officeParserConfig.setIncludeDeletedContent(false);
> officeParserConfig.setIncludeMoveFromContent(false);
> officeParserConfig.setIncludeHeadersAndFooters(false);
> parseContext.set(OfficeParserConfig.class, officeParserConfig);
> wrapper.parse(inputStream, new DefaultHandler(), metadata, parseContext);
> String contents = metadata.get(RecursiveParserWrapper.TIKA_CONTENT);
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
|