tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tim Allison (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (TIKA-2833) Add a CSV/TSV detector
Date Thu, 28 Feb 2019 19:35:00 GMT

    [ https://issues.apache.org/jira/browse/TIKA-2833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16780854#comment-16780854
] 

Tim Allison commented on TIKA-2833:
-----------------------------------

The real test will be against the full corpus to see how many false positives we have for
files identified as csv but are actually plain text.

In addition to adding a first pass (horrifically heuristic) detector, I also added backoff
if there is a parse exception to treat whatever is left in the Reader as if it is plain text.
 We could customize the reader (wrap it in something) to capture content that is buffered
in the o.a.c.csv.CSVParser when the exception was hit.

> Add a CSV/TSV detector
> ----------------------
>
>                 Key: TIKA-2833
>                 URL: https://issues.apache.org/jira/browse/TIKA-2833
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Minor
>         Attachments: csv_reports.zip
>
>
> Given initial experimentation, I think we can fairly easily add a fairly robust CSV/TSV
detector that will identify well-formed (ha!) csvs and return the charset encoding and the
delimiter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message