tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daniel Gibby (JIRA)" <j...@apache.org>
Subject [jira] [Reopened] (TIKA-1130) .docx text extract leaves out some portions of text
Date Wed, 10 Jul 2013 16:01:51 GMT

     [ https://issues.apache.org/jira/browse/TIKA-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Daniel Gibby reopened TIKA-1130:
--------------------------------


I found some files that still exhibit the problem of not all text being extracted. If the
problem is still the same underlying POI, perhaps these POI issues should all be handled in
this ticket? Or should a new ticket be opened?
                
> .docx text extract leaves out some portions of text
> ---------------------------------------------------
>
>                 Key: TIKA-1130
>                 URL: https://issues.apache.org/jira/browse/TIKA-1130
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.2, 1.3
>         Environment: OpenJDK x86_64
>            Reporter: Daniel Gibby
>            Priority: Critical
>             Fix For: 1.5
>
>         Attachments: OwenResume.docx, Resume 6.4.13.docx, tee internal resme.docx, TIKA-1130.patch,
TIKA-1130.patch
>
>
> When parsing a Microsoft Word .docx (application/vnd.openxmlformats-officedocument.wordprocessingml.document),
certain portions of text remain unextracted.
> I have attached a .docx file that can be tested against. The 'gray' portions of text
are what are not extracted, while the darker colored text extracts fine.
> Looking at the document.xml portion of the .docx zip file shows the text is all there.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message