tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chris A. Mattmann (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1483) Create a general raw string parser
Date Wed, 25 Feb 2015 07:36:05 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14336150#comment-14336150
] 

Chris A. Mattmann commented on TIKA-1483:
-----------------------------------------

That fixed it [~gostep]! Thanks all tests are passing for me. +1 to commit this.  If there
are no objections in the next 24 hours I'll commit it.

{noformat}
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO] 
[INFO] Apache Tika parent ................................. SUCCESS [  1.821 s]
[INFO] Apache Tika core ................................... SUCCESS [ 21.645 s]
[INFO] Apache Tika parsers ................................ SUCCESS [02:06 min]
[INFO] Apache Tika XMP .................................... SUCCESS [  2.072 s]
[INFO] Apache Tika serialization .......................... SUCCESS [  2.382 s]
[INFO] Apache Tika application ............................ SUCCESS [ 14.697 s]
[INFO] Apache Tika OSGi bundle ............................ SUCCESS [ 17.896 s]
[INFO] Apache Tika server ................................. SUCCESS [ 21.473 s]
[INFO] Apache Tika translate .............................. SUCCESS [  2.746 s]
[INFO] Apache Tika examples ............................... SUCCESS [  5.429 s]
[INFO] Apache Tika Java-7 Components ...................... SUCCESS [  2.680 s]
[INFO] Apache Tika ........................................ SUCCESS [  0.038 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 03:39 min
[INFO] Finished at: 2015-02-24T23:32:12-08:00
[INFO] Final Memory: 100M/1653M
[INFO] ------------------------------------------------------------------------
[chipotle:~/tmp/tika] mattmann% 

{noformat}

> Create a general raw string parser
> ----------------------------------
>
>                 Key: TIKA-1483
>                 URL: https://issues.apache.org/jira/browse/TIKA-1483
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.6
>            Reporter: Luis Filipe Nassif
>         Attachments: TIKA-1483.patch, TIKA-1483_v2.patch
>
>
> I think it can be very useful adding a general parser able to extract raw strings from
files (like the strings command), which can be used as the fallback parser for all mimetypes
not having a specific parser implementation, like application/octet-stream. It can also be
used as a fallback for corrupt files throwing a TikaException.
> It must be configured with the script/language to be extracted from the files (currently
I implemented one specific for Latin1).
> It can use heuristics to extract strings encoded with different charsets within the same
file, mainly the common ISO-8859-1, UTF8 and UTF16.
> What the community thinks about that?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message