tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Luis Filipe Nassif (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1541) StringsParser: a simple strings-based parser for Tika
Date Sat, 07 Feb 2015 20:07:35 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14310905#comment-14310905
] 

Luis Filipe Nassif commented on TIKA-1541:
------------------------------------------

Another suggestion, I think the parser should not set the contentType as octet-stream, so
it can be used to parse known types without a specific parser and corrupted files that caused
their parsers to throw an exception.

> StringsParser: a simple strings-based parser for Tika
> -----------------------------------------------------
>
>                 Key: TIKA-1541
>                 URL: https://issues.apache.org/jira/browse/TIKA-1541
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Giuseppe Totaro
>            Assignee: Chris A. Mattmann
>         Attachments: TIKA-1541.TotaroMattmann.020615.patch.txt, TIKA-1541.TotaroMattmann.020615.patch.txt,
TIKA-1541.patch
>
>
> I thought to implement an extremely simple implementation of {{StringsParser}}, a parser
based on the {{strings}} command (or {{strings}}-alternative command), instead of using the
dummy {{EmptyParser}} for undetected files. It is a preliminary work (you can see a lot of
todos). It is inspired by the work on {{TesseractOCRParser}}. You can find the patch in attachment.
> I created a GitHub [repository|https://github.com/giuseppetotaro/StringsParser] for sharing
the code. As first test, you can clone the repo, build the code using the {{build.sh}} script,
and then run the parser using the {{run.sh}} script on some [govdocs1|http://digitalcorpora.org/corpora/govdocs]
files (grabbed from "016" subset) detected as {{application/octet-stream}}. The latter script
launches a simple {{StringsTest}} class for testing.
> I hope you will find the {{StringsParser}} a good solution for extracting ASCII strings
from undetected filetypes. As far as I understood, many "sophisticated" forensics tools work
in a similar manner for indexing purposes. They use a sort of {{strings}} command against
files that they are not able to detect.
> In addition to run {{strings}} on undetected files, the {{StringsParser}} launches the
{{file}} command on undetected files and then writes the output in the {{strings:file_output}}
property (I noticed that sometimes the {{file}} command is able to detect the media type for
documents not detected by Tika).
> Finally, you can fine an old discussion about this topic [here|http://lucene.472066.n3.nabble.com/Default-MIME-Type-td645215.html].
Thanks [~chrismattmann].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message