tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Md (JIRA)" <j...@apache.org>
Subject [jira] [Created] (TIKA-2900) Removing comments from *.docx, *.pdf files
Date Mon, 08 Jul 2019 16:13:00 GMT
Md created TIKA-2900:
------------------------

             Summary: Removing comments from *.docx, *.pdf files
                 Key: TIKA-2900
                 URL: https://issues.apache.org/jira/browse/TIKA-2900
             Project: Tika
          Issue Type: Wish
          Components: app, example
    Affects Versions: 1.21
            Reporter: Md


Hello,

I do use Apache tika to extract text from mostly *.doc, *docx and *pdf files. Sometimes there
are comments in the file and tika is extracting them and adding them at the end of the file.
I am wondering to know is there a way to exclude comments when it will be extracting text. 

 

Here is the following code I am using 

```
StringBuilder fileContent = new StringBuilder();
Parserparser=newAutoDetectParser();
ContentHandlerFactoryfactory=newBasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.HTML,
-1);
//InputStream inputStream = new BufferedInputStream(new FileInputStream(inputFileName));
RecursiveParserWrapperwrapper=newRecursiveParserWrapper(parser, factory);
Metadatametadata=newMetadata();
ParseContextparseContext=newParseContext();
OfficeParserConfigofficeParserConfig=newOfficeParserConfig();
officeParserConfig.setUseSAXDocxExtractor(true);
officeParserConfig.setIncludeDeletedContent(false);
officeParserConfig.setIncludeMoveFromContent(false);
officeParserConfig.setIncludeHeadersAndFooters(false);
parseContext.set(OfficeParserConfig.class, officeParserConfig);

wrapper.parse(inputStream, newDefaultHandler(), metadata, parseContext);
Stringcontents=metadata.get(RecursiveParserWrapper.TIKA_CONTENT);
 
```



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message