tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shai Erera (JIRA)" <j...@apache.org>
Subject [jira] [Created] (TIKA-1131) Output sentence-break "hints" for files such as PPT/X
Date Thu, 06 Jun 2013 21:00:21 GMT
Shai Erera created TIKA-1131:
--------------------------------

             Summary: Output sentence-break "hints" for files such as PPT/X
                 Key: TIKA-1131
                 URL: https://issues.apache.org/jira/browse/TIKA-1131
             Project: Tika
          Issue Type: Improvement
          Components: parser
            Reporter: Shai Erera
            Priority: Minor


Spinoff from here: http://tika.markmail.org/thread/xk5sclapbeonifzr. I believe that usually
these files contain text that does not end with the usual sentence breaks. As I've shown in
the email, the parser seems to detect e.g. different bullets by inserting manual '\n' characters,
but that's not enough per the sentence segmentation rules of UAX#29.

It would be better if the parser output a clearer marker which the user could then replace
with a true sentence break (e.g. \u2029), rather than arbitrarily replacing every '\n', which
I think is not a good general solution.

BTW, I parsed Impress files and it seems the parser does output some hints (I think <p>
tags).

I'll upload an isolated test which generates the output as I put in the email.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message