tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2833) Add a CSV/TSV detector
Date Tue, 26 Feb 2019 14:00:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16777934#comment-16777934

Tim Allison commented on TIKA-2833:

Initial question is where to place this detector. It should only be triggered after all of
the other user-specified detectors _and_ after MimeTypes.

Some options:
1) Build it into MimeTypes and run it only once MimeTypes is about to return {{text/plain}}
-- I don't want to hardwire this into MimeTypes, though.
2) Run it in TXTParser before parsing the text...I don't like this because it bypasses the
usual detector configurability and hardwires it into TXTParser.
3) Manually add it after adding MimeTypes in DefaultDetector.getDefaultDetectors() -- I like
this because users can configure turning it off, but it is smelly/hacky
4) Create a separate class (LowPriorityDetector (ugh!)) or add a parameter for sorting that
will guarantee that the CSVDetector is run after MimeTypes.
5) Make CSVParser allege that it can parse {{text/plain}}, run its detection before the parse
and if it detects regular text and/or not a CSV, back off to the TXTParser or replicate TXTParser's
behavior.  This would allow users to turn off the CSVParser and detection via the usual {{exclude}}
option on the CSVParser.  

Any recommendations/preferences?  I'm currently inclined to 5, but I suspect there may be
a more elegant answer.

> Add a CSV/TSV detector
> ----------------------
>                 Key: TIKA-2833
>                 URL: https://issues.apache.org/jira/browse/TIKA-2833
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Minor
> Given initial experimentation, I think we can fairly easily add a fairly robust CSV/TSV
detector that will identify well-formed (ha!) csvs and return the charset encoding and the

This message was sent by Atlassian JIRA

View raw message