tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dennis Adler (JIRA)" <j...@apache.org>
Subject [jira] Updated: (TIKA-583) Tika 0.8 line break removal is faulty (misses space when concatenating lines) for PDF file
Date Fri, 14 Jan 2011 01:14:45 GMT

     [ https://issues.apache.org/jira/browse/TIKA-583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Dennis Adler updated TIKA-583:
------------------------------

    Attachment: Savchuk v. Jerde.pdf

Original PDF; parsed with tika-app-0.7 and tika-app-0.8 (release). Sample text in the bug
report from the "Plain text" tabs. Found this file on the web, so should be fine for ASF inclusion.

> Tika 0.8 line break removal is faulty (misses space when concatenating lines) for PDF
file
> ------------------------------------------------------------------------------------------
>
>                 Key: TIKA-583
>                 URL: https://issues.apache.org/jira/browse/TIKA-583
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.8
>         Environment: Win Pro 7, x64, jdk1.6.0_22, jre 6.0.220.4 
>            Reporter: Dennis Adler
>         Attachments: Savchuk v. Jerde.pdf
>
>
> The included PDF (a legal filing from the web) when parsed by Tika 0.7 has the following
as its first several lines of plain text:
> ------- start ---------------
> IN THE COURT OF APPEALS OF THE STATE OF WASHINGTON
> DIVISION ONE
>   SERGEY SAVCHUK, )
>  ) No. 64269-3-I
>  Appellant, )
>  v. )
>  ) UNPUBLISHED OPINION
>  STEVEN G. JERDE and )
>  DARLYCE J. JERDE, husband and wife )
> )
>  Respondents. )
>  _______________________________  ) FILED: November 1, 2010
> --------------- end ---------------------
> Tika 0.8 has this instead:
> -------------- start ---------------------
> IN THE COURT OF APPEALS OF THE STATE OF WASHINGTONDIVISION ONESERGEYSAVCHUK,))No. 64269-3-IAppellant,)v.))UNPUBLISHED
OPINIONSTEVENG. JERDE and )DARLYCE J. JERDE, husband and wife))Respondents.)_______________________________
 )FILED: November 1, 2010schindler, j
> --------------- end ---------------------
> Notice that as part of the improved paragraph breaking for PDF files, the "header" of
the document had lines catenated together without spaces in between, creating run-on words
(e.g. "WASHINGTONDIVISION" and "ONESERGEYSAVCHUK"). See the original PDF for more details
and compare to the text.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message