tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dennis Adler (JIRA)" <j...@apache.org>
Subject [jira] Created: (TIKA-583) Tika 0.8 line break removal is faulty (misses space when concatenating lines) for PDF file
Date Fri, 14 Jan 2011 01:12:45 GMT
Tika 0.8 line break removal is faulty (misses space when concatenating lines) for PDF file
------------------------------------------------------------------------------------------

                 Key: TIKA-583
                 URL: https://issues.apache.org/jira/browse/TIKA-583
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 0.8
         Environment: Win Pro 7, x64, jdk1.6.0_22, jre 6.0.220.4 
            Reporter: Dennis Adler


The included PDF (a legal filing from the web) when parsed by Tika 0.7 has the following as
its first several lines of plain text:
------- start ---------------
IN THE COURT OF APPEALS OF THE STATE OF WASHINGTON
DIVISION ONE
  SERGEY SAVCHUK, )
 ) No. 64269-3-I
 Appellant, )
 v. )
 ) UNPUBLISHED OPINION
 STEVEN G. JERDE and )
 DARLYCE J. JERDE, husband and wife )
)
 Respondents. )
 _______________________________  ) FILED: November 1, 2010
--------------- end ---------------------

Tika 0.8 has this instead:
-------------- start ---------------------
IN THE COURT OF APPEALS OF THE STATE OF WASHINGTONDIVISION ONESERGEYSAVCHUK,))No. 64269-3-IAppellant,)v.))UNPUBLISHED
OPINIONSTEVENG. JERDE and )DARLYCE J. JERDE, husband and wife))Respondents.)_______________________________
 )FILED: November 1, 2010schindler, j
--------------- end ---------------------

Notice that as part of the improved paragraph breaking for PDF files, the "header" of the
document had lines catenated together without spaces in between, creating run-on words (e.g.
"WASHINGTONDIVISION" and "ONESERGEYSAVCHUK"). See the original PDF for more details and compare
to the text.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message