tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nick Burch (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2711) When parsing a UNIX text file apostrophes are rendered as ?
Date Thu, 06 Sep 2018 11:27:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16605660#comment-16605660

Nick Burch commented on TIKA-2711:

On the long text, Tika is detecting the dos version as {{windows-1252}} encoding and the unix
one as {{ISO-8859-1}}, which are similar but not the same. Most likely it is the {{\r\n}}
vs {{\n}} characters which is tipping the detection one way or the other, as it's probability
and n-gram based.

If you know what encoding you used, tell Tika that! It'll then be fine.

Or get your users to use proper normal apostrophes not the dodgy windows special ones ;)

> When parsing a UNIX text file apostrophes are rendered as ?
> -----------------------------------------------------------
>                 Key: TIKA-2711
>                 URL: https://issues.apache.org/jira/browse/TIKA-2711
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.18
>         Environment: Windows 10
>            Reporter: Ichbiah
>            Priority: Minor
>             Fix For: 1.19
>         Attachments: long_text_dos.txt, long_text_unix.txt, petit_dos.txt, petit_unix.txt
>   Original Estimate: 12h
>  Remaining Estimate: 12h
> I have a small text file in two versions:
>  * a dos version of the file
>  * a unix version of the file
> Both contain the same text below:
> La politique macroéconomique cesse officiellement d’être 
> l’alpha et l’oméga de la lutte contre le chômage.
> When I parse them using the tika-app.jar, the text is correctly "extracted" from the
DOS version of the file. For the UNIX version of the file the apostrophes are falsely rendered
as question marks.

This message was sent by Atlassian JIRA

View raw message