tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2052) Words are separated where there the letters are spaced together in the PDF document
Date Tue, 09 Aug 2016 14:24:20 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15413601#comment-15413601
] 

Tim Allison commented on TIKA-2052:
-----------------------------------

Y, this is a problem with PDFs generally.  Try the troubleshooting recommendations on our
[wiki|https://wiki.apache.org/tika/Troubleshooting%20Tika#PDF_Text_Problems].

> Words are separated where there the letters are spaced together in the PDF document
> -----------------------------------------------------------------------------------
>
>                 Key: TIKA-2052
>                 URL: https://issues.apache.org/jira/browse/TIKA-2052
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Sebastian Landwehr
>
> For example in the following document:
> https://www.g-ba.de/downloads/39-261-2062/2014-08-21_QSKH-RL_Q-Report_2013.pdf
> Searching for "onsimpulse des Herzschrittmachers" finds the location where "Herzschrittmacher"
is separated into "Herzschrittma chers". This is especially problematic when using the PDF
for full text search because often such end syllables are found which are not really part
of the content. The whitespace config parameter did not help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message