tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Nick Burch (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2711) When parsing a UNIX text file apostrophes are rendered as ?
Date Thu, 06 Sep 2018 07:02:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16605357#comment-16605357

Nick Burch commented on TIKA-2711:

Text files do not include any encoding information, so Tika has to guess one before it can
process the file. To do that guessing, the more text that Tika has to work with, the more
accurate it can be

Can you try giving Tika a much longer set of French text in the two formats, and see if it
gets it right for both?

(IIRC we use the first few KB of text to do the analysis. Very short runs of text are always
a problem for encoding and language detection, as there's not enough to go on to be sure which
of the many possibilities is correct)

Alternately, if you know for sure the text encoding used, you can tell Tika that and it'll
help a lot!

> When parsing a UNIX text file apostrophes are rendered as ?
> -----------------------------------------------------------
>                 Key: TIKA-2711
>                 URL: https://issues.apache.org/jira/browse/TIKA-2711
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.18
>         Environment: Windows 10
>            Reporter: Ichbiah
>            Priority: Minor
>             Fix For: 1.19
>         Attachments: petit_dos.txt, petit_unix.txt
>   Original Estimate: 12h
>  Remaining Estimate: 12h
> I have a small text file in two versions:
>  * a dos version of the file
>  * a unix version of the file
> Both contain the same text below:
> La politique macroéconomique cesse officiellement d’être 
> l’alpha et l’oméga de la lutte contre le chômage.
> When I parse them using the tika-app.jar, the text is correctly "extracted" from the
DOS version of the file. For the UNIX version of the file the apostrophes are falsely rendered
as question marks.

This message was sent by Atlassian JIRA

View raw message