tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Md (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (TIKA-2900) Removing comments from *.docx, *.pdf files
Date Mon, 08 Jul 2019 16:16:00 GMT

     [ https://issues.apache.org/jira/browse/TIKA-2900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Md updated TIKA-2900:
---------------------
    Description: 
Hello,

I do use Apache tika to extract text from mostly *.doc, *docx and *pdf files. Sometimes there
are comments in the file and tika is extracting them and adding them at the end of the file.
I am wondering to know is there a way to exclude comments when it will be extracting text. 


Here is the following code I am using 

{code:java}
     StringBuilder fileContent = new StringBuilder();
        Parser parser = new AutoDetectParser();
        ContentHandlerFactory factory = new BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.HTML,
                -1);
        //InputStream inputStream = new BufferedInputStream(new FileInputStream(inputFileName));
        RecursiveParserWrapper wrapper = new RecursiveParserWrapper(parser, factory);
        Metadata metadata = new Metadata();

        ParseContext parseContext = new ParseContext();
        OfficeParserConfig officeParserConfig = new OfficeParserConfig();
        officeParserConfig.setUseSAXDocxExtractor(true);
        officeParserConfig.setIncludeDeletedContent(false);
        officeParserConfig.setIncludeMoveFromContent(false);
        officeParserConfig.setIncludeHeadersAndFooters(false);
        parseContext.set(OfficeParserConfig.class, officeParserConfig);

        wrapper.parse(inputStream, new DefaultHandler(), metadata, parseContext);
        String contents = metadata.get(RecursiveParserWrapper.TIKA_CONTENT);
        {code}

  was:
Hello,

I do use Apache tika to extract text from mostly *.doc, *docx and *pdf files. Sometimes there
are comments in the file and tika is extracting them and adding them at the end of the file.
I am wondering to know is there a way to exclude comments when it will be extracting text. 

 

Here is the following code I am using 

```
StringBuilder fileContent = new StringBuilder();
Parserparser=newAutoDetectParser();
ContentHandlerFactoryfactory=newBasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.HTML,
-1);
//InputStream inputStream = new BufferedInputStream(new FileInputStream(inputFileName));
RecursiveParserWrapperwrapper=newRecursiveParserWrapper(parser, factory);
Metadatametadata=newMetadata();
ParseContextparseContext=newParseContext();
OfficeParserConfigofficeParserConfig=newOfficeParserConfig();
officeParserConfig.setUseSAXDocxExtractor(true);
officeParserConfig.setIncludeDeletedContent(false);
officeParserConfig.setIncludeMoveFromContent(false);
officeParserConfig.setIncludeHeadersAndFooters(false);
parseContext.set(OfficeParserConfig.class, officeParserConfig);

wrapper.parse(inputStream, newDefaultHandler(), metadata, parseContext);
Stringcontents=metadata.get(RecursiveParserWrapper.TIKA_CONTENT);
 
```


> Removing comments from *.docx, *.pdf files
> ------------------------------------------
>
>                 Key: TIKA-2900
>                 URL: https://issues.apache.org/jira/browse/TIKA-2900
>             Project: Tika
>          Issue Type: Wish
>          Components: app, example
>    Affects Versions: 1.21
>            Reporter: Md
>            Priority: Major
>
> Hello,
> I do use Apache tika to extract text from mostly *.doc, *docx and *pdf files. Sometimes
there are comments in the file and tika is extracting them and adding them at the end of the
file. I am wondering to know is there a way to exclude comments when it will be extracting
text. 
> Here is the following code I am using 
> {code:java}
>      StringBuilder fileContent = new StringBuilder();
>         Parser parser = new AutoDetectParser();
>         ContentHandlerFactory factory = new BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.HTML,
>                 -1);
>         //InputStream inputStream = new BufferedInputStream(new FileInputStream(inputFileName));
>         RecursiveParserWrapper wrapper = new RecursiveParserWrapper(parser, factory);
>         Metadata metadata = new Metadata();
>         ParseContext parseContext = new ParseContext();
>         OfficeParserConfig officeParserConfig = new OfficeParserConfig();
>         officeParserConfig.setUseSAXDocxExtractor(true);
>         officeParserConfig.setIncludeDeletedContent(false);
>         officeParserConfig.setIncludeMoveFromContent(false);
>         officeParserConfig.setIncludeHeadersAndFooters(false);
>         parseContext.set(OfficeParserConfig.class, officeParserConfig);
>         wrapper.parse(inputStream, new DefaultHandler(), metadata, parseContext);
>         String contents = metadata.get(RecursiveParserWrapper.TIKA_CONTENT);
>         {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message