tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris A. Mattmann (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (TIKA-1483) Create a Latin1 charset raw string parser
Date Thu, 26 Feb 2015 03:47:05 GMT

     [ https://issues.apache.org/jira/browse/TIKA-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Chris A. Mattmann resolved TIKA-1483.
-------------------------------------
       Resolution: Fixed
    Fix Version/s: 1.8

OK parser added in r1662349 and r1662350. Thanks [~lfcnassif]! Thanks [~gostep] for testing!

> Create a Latin1 charset raw string parser
> -----------------------------------------
>
>                 Key: TIKA-1483
>                 URL: https://issues.apache.org/jira/browse/TIKA-1483
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.6
>            Reporter: Luis Filipe Nassif
>            Assignee: Chris A. Mattmann
>             Fix For: 1.8
>
>         Attachments: TIKA-1483.patch, TIKA-1483_v2.patch
>
>
> I think it can be very useful adding a general parser able to extract raw strings from
files (like the strings command), which can be used as the fallback parser for all mimetypes
not having a specific parser implementation, like application/octet-stream. It can also be
used as a fallback for corrupt files throwing a TikaException.
> It must be configured with the script/language to be extracted from the files (currently
I implemented one specific for Latin1).
> It can use heuristics to extract strings encoded with different charsets within the same
file, mainly the common ISO-8859-1, UTF8 and UTF16.
> What the community thinks about that?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message