tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Luis Filipe Nassif (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-1541) StringsParser: a simple strings-based parser for Tika
Date Sat, 07 Feb 2015 19:56:34 GMT

    [ https://issues.apache.org/jira/browse/TIKA-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14310899#comment-14310899

Luis Filipe Nassif commented on TIKA-1541:

Hi Chris, I definitely agree Giuseppe's patch is a great start!

But see that in TIKA-1483 I said I have a specific implementation for extracting Latin1 scripts
coded with ISO8859-1, UTF8 and UTF16 charsets at the same time (less general than proposed
in the issue) and asked if it would be of interest. If the community still thinks it would
be useful, I will submit a patch.

A possible improvement to Giuseppe's patch is to let the user configure the encoding parameter
of unix strings, it is not hard to write and is a powerful configuration.

I agree to not enable it by default for octet-stream, as I also suggested to not enable TesseractOCRParser
by default in the past, they can add a lot of time to parsing and surprise users as Nick pointed.

> StringsParser: a simple strings-based parser for Tika
> -----------------------------------------------------
>                 Key: TIKA-1541
>                 URL: https://issues.apache.org/jira/browse/TIKA-1541
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Giuseppe Totaro
>            Assignee: Chris A. Mattmann
>         Attachments: TIKA-1541.TotaroMattmann.020615.patch.txt, TIKA-1541.TotaroMattmann.020615.patch.txt,
> I thought to implement an extremely simple implementation of {{StringsParser}}, a parser
based on the {{strings}} command (or {{strings}}-alternative command), instead of using the
dummy {{EmptyParser}} for undetected files. It is a preliminary work (you can see a lot of
todos). It is inspired by the work on {{TesseractOCRParser}}. You can find the patch in attachment.
> I created a GitHub [repository|https://github.com/giuseppetotaro/StringsParser] for sharing
the code. As first test, you can clone the repo, build the code using the {{build.sh}} script,
and then run the parser using the {{run.sh}} script on some [govdocs1|http://digitalcorpora.org/corpora/govdocs]
files (grabbed from "016" subset) detected as {{application/octet-stream}}. The latter script
launches a simple {{StringsTest}} class for testing.
> I hope you will find the {{StringsParser}} a good solution for extracting ASCII strings
from undetected filetypes. As far as I understood, many "sophisticated" forensics tools work
in a similar manner for indexing purposes. They use a sort of {{strings}} command against
files that they are not able to detect.
> In addition to run {{strings}} on undetected files, the {{StringsParser}} launches the
{{file}} command on undetected files and then writes the output in the {{strings:file_output}}
property (I noticed that sometimes the {{file}} command is able to detect the media type for
documents not detected by Tika).
> Finally, you can fine an old discussion about this topic [here|http://lucene.472066.n3.nabble.com/Default-MIME-Type-td645215.html].
Thanks [~chrismattmann].

This message was sent by Atlassian JIRA

View raw message