tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sebastian Landwehr (JIRA)" <j...@apache.org>
Subject [jira] [Closed] (TIKA-2052) Words are separated where there the letters are spaced together in the PDF document
Date Tue, 09 Aug 2016 15:09:20 GMT

     [ https://issues.apache.org/jira/browse/TIKA-2052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Sebastian Landwehr closed TIKA-2052.
------------------------------------
    Resolution: Not A Problem

PDFBox issue ...

> Words are separated where there the letters are spaced together in the PDF document
> -----------------------------------------------------------------------------------
>
>                 Key: TIKA-2052
>                 URL: https://issues.apache.org/jira/browse/TIKA-2052
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Sebastian Landwehr
>
> For example in the following document:
> https://www.g-ba.de/downloads/39-261-2062/2014-08-21_QSKH-RL_Q-Report_2013.pdf
> Searching for "onsimpulse des Herzschrittmachers" finds the location where "Herzschrittmacher"
is separated into "Herzschrittma chers". This is especially problematic when using the PDF
for full text search because often such end syllables are found which are not really part
of the content. The whitespace config parameter did not help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message