tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gary Gregory <garydgreg...@gmail.com>
Subject Re: [csv] csv format detector/sniffer?
Date Mon, 25 Feb 2019 15:30:57 GMT
Hi,

A Charset detector sounds like something generally useful that belongs in
Commons IO.

Path path = Path.get(...);
Charset cs = org.apache.commons.io.CharsetDetector.detect(path);
org.apache.commons.csv.CSVParser.parse(path, charset, csvFormat);

Thoughts?

Gary


On Mon, Feb 25, 2019 at 10:23 AM Tim Allison <tallison@apache.org> wrote:

> Commons-CSV team,
>
>   We recently integrated Commons-CSV into Apache Tika.  For now, we’re
> relying strictly on the filename for csv detection, and we’re relying
> on our AutodetectReader to identify the charset.  It would be really
> useful for us to be able to detect:
>
> 1) A csv/tsv file vs a regular .txt file by content heuristics
> 2) The parameters: delimiter, escape and quote characters
>
>   We realize that no detection will be perfect, but we have two questions:
>
> 1) Do you have any pointers for this kind of thing?
> 2) If we develop it, would you want to put it in commons-csv or should
> we leave it in Tika?  I'm not sure, yet, if there'd be a clean/useful
> way to integrate this without using a charset detector...but we can
> hold off on that for now.
>
>   Thank you for all of your fantastic work!
>
>            Cheers,
>
>                            Tim
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@commons.apache.org
> For additional commands, e-mail: user-help@commons.apache.org
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message