tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Michael McCandless (JIRA)" <j...@apache.org>
Subject [jira] [Assigned] (TIKA-1005) In Microsoft Office Word 2010 documents, text inside a textbox is not extracted/parsed out.
Date Fri, 12 Oct 2012 12:23:04 GMT

     [ https://issues.apache.org/jira/browse/TIKA-1005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Michael McCandless reassigned TIKA-1005:
----------------------------------------

    Assignee: Michael McCandless
    
> In Microsoft Office Word 2010 documents, text inside a textbox is not extracted/parsed
out.
> -------------------------------------------------------------------------------------------
>
>                 Key: TIKA-1005
>                 URL: https://issues.apache.org/jira/browse/TIKA-1005
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.2
>         Environment: Windows 7, Windows Server 2008, Windows Server 2008 R2 (32bit and
64bit each)
>            Reporter: David A. Patterson
>            Assignee: Michael McCandless
>         Attachments: Textbox example.docx
>
>
> Text inside a textbox, which itself can be in the body, the header or the footer, is
not extracted using any type of parser (including AutoDetectParser) in combination with any
type of ContentHandler.  This is NOT a duplicate of TIKA-904.  This specifically concerns
the .docx file format.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message