tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Franz Canaval (Resolved) (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (TIKA-796) Tika breaks words of rotated text in PDF documents
Date Fri, 20 Jan 2012 08:54:40 GMT

     [ https://issues.apache.org/jira/browse/TIKA-796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Franz Canaval resolved TIKA-796.
--------------------------------

    Resolution: Duplicate

Duplicate of https://issues.apache.org/jira/browse/TIKA-723
                
> Tika breaks words of rotated text in PDF documents
> --------------------------------------------------
>
>                 Key: TIKA-796
>                 URL: https://issues.apache.org/jira/browse/TIKA-796
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.10, 1.0
>         Environment: Windows 7 Professional x64, Java(TM) SE Runtime Environment (build
1.6.0_25-b06), Java HotSpot(TM) 64-Bit Server VM (build 20.0-b11, mixed mode)
>            Reporter: Franz Canaval
>              Labels: broken, linefeed, pdf, rotated, text, words
>
> When Tika extracts text from a PDF file, *rotated texts are extracted in a way that words
are broken.* Apparently the number of lines of a rotated paragraph seems to be the number
of characters after which Tika breaks the words apart with a line feed (0x0a) character.
> Steps to reproduce this issue (in this example, on a Windows machine):
> * Download the following pdf file: [http://www.verbraucherzentrale-rlp.de/mediabig/115471A.pdf],
e.g. to C:\temp\
> * Open a console window and run tika with: {{java -jar tika-app.jar -t "file:///c:/temp/energieberatung.pdf"
> test.txt}}
> * Have a look at the text file, e.g. with a hex editor and note the words broken in 2-character-pieces:
{{<char1><char2><LF>}}
> *This problems seems to be introduced with Tika 0.10, it still exists with Tika 1.0.*

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message