tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thomas (Jira)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2932) Filter Documents Meta Data
Date Mon, 02 Sep 2019 04:03:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16920603#comment-16920603
] 

Thomas commented on TIKA-2932:
------------------------------

[~tallison@apache.org], Thanks a lot :) I understand now that word does not always store
the information.

I also wanted to know if there is a way to sanitize the text that I am getting so that it
contains only text no image, bookmarks or other data?

> Filter Documents Meta Data
> --------------------------
>
>                 Key: TIKA-2932
>                 URL: https://issues.apache.org/jira/browse/TIKA-2932
>             Project: Tika
>          Issue Type: Wish
>          Components: parser
>    Affects Versions: 1.22
>            Reporter: Thomas
>            Priority: Minor
>              Labels: newbie
>
> Hello!
> Is there a way so that I can filter out tags like , *[image: ]* [bookmark] from the text
I get while parsing the Docs? I need it because sometimes the Metadata does not returns number
of words from a document if it contains images or tables
> *MetaData*
> {"title":"Complete name,","description":null,"keywords":[],"language":"en","encoding":null,"author":"","generator":"Microsoft
Office Word","pages":0,"words":0 ...
> *Text*
> [image: ] Certified Translation Certificate of Accuracy Your name here Translator/Interpreter
Translated document: [bookmark: _GoBack]As a translator for Your Spanish Translation, Inc.,
I, Your name here, declare that I am a bilingual translator who is thoroughly familiar with
the English and source language languages. I have translated the attached document to the
best of my knowledge from source language into English and the English text is an accurate
and true translation of the original document presented to the best of my knowledge and belief.
Signed on June 1, 201 Sign here in blue ink Your name here Professional Translator for Day
Translations, Inc. [bookmark: _MailAutoSig]
> Please help!



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

Mime
View raw message