tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ichbiah (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2711) When parsing a UNIX text file apostrophes are rendered as ?
Date Thu, 06 Sep 2018 09:45:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16605548#comment-16605548

Ichbiah commented on TIKA-2711:

I can provide longer text. The problem is the same. I wonder why Tika is not consistent, it
is the same text and the TIKA output differs if the file is DOS or UNIX. It produces the right
text for DOS but not for UNIX where the apostrophes are NOT rendered well.

> When parsing a UNIX text file apostrophes are rendered as ?
> -----------------------------------------------------------
>                 Key: TIKA-2711
>                 URL: https://issues.apache.org/jira/browse/TIKA-2711
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.18
>         Environment: Windows 10
>            Reporter: Ichbiah
>            Priority: Minor
>             Fix For: 1.19
>         Attachments: long_text_dos.txt, long_text_unix.txt, petit_dos.txt, petit_unix.txt
>   Original Estimate: 12h
>  Remaining Estimate: 12h
> I have a small text file in two versions:
>  * a dos version of the file
>  * a unix version of the file
> Both contain the same text below:
> La politique macroéconomique cesse officiellement d’être 
> l’alpha et l’oméga de la lutte contre le chômage.
> When I parse them using the tika-app.jar, the text is correctly "extracted" from the
DOS version of the file. For the UNIX version of the file the apostrophes are falsely rendered
as question marks.

This message was sent by Atlassian JIRA

View raw message