tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Allison <talli...@apache.org>
Subject Re: [csv] csv format detector/sniffer?
Date Mon, 25 Feb 2019 18:38:15 GMT
Hi Gary,

Our charset detector stuff is a combo of html-metaheader detection,
juniversalchardet and a cut and paste of a small portion of icu4j...we
could add that to commons-io, but I don't think you'd want to add
juniversalchardet as a dependency or would you?  Happy to discuss...

My main question to commons-csv was intended rather to focus on:

1) text vs csv detection (aside from filename glob)
2) detection of most likely: a) delimiter, b) quote character, c)
escape character

 More like:

org.apache.commons.csv.CSVParser.parse(path, charset);

or ideally:

CSVFormat format = CSVDetector.detect(path)

where format includes charset and one value is "probably straight
text, not likely a csv"

On Mon, Feb 25, 2019 at 10:39 AM Gary Gregory <garydgregory@gmail.com> wrote:
>
> Hi,
>
> A Charset detector sounds like something generally useful that belongs in
> Commons IO.
>
> Path path = Path.get(...);
> Charset cs = org.apache.commons.io.CharsetDetector.detect(path);
> org.apache.commons.csv.CSVParser.parse(path, charset, csvFormat);
>
> Thoughts?
>
> Gary
>
>
> On Mon, Feb 25, 2019 at 10:23 AM Tim Allison <tallison@apache.org> wrote:
>
> > Commons-CSV team,
> >
> >   We recently integrated Commons-CSV into Apache Tika.  For now, we’re
> > relying strictly on the filename for csv detection, and we’re relying
> > on our AutodetectReader to identify the charset.  It would be really
> > useful for us to be able to detect:
> >
> > 1) A csv/tsv file vs a regular .txt file by content heuristics
> > 2) The parameters: delimiter, escape and quote characters
> >
> >   We realize that no detection will be perfect, but we have two questions:
> >
> > 1) Do you have any pointers for this kind of thing?
> > 2) If we develop it, would you want to put it in commons-csv or should
> > we leave it in Tika?  I'm not sure, yet, if there'd be a clean/useful
> > way to integrate this without using a charset detector...but we can
> > hold off on that for now.
> >
> >   Thank you for all of your fantastic work!
> >
> >            Cheers,
> >
> >                            Tim
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: user-unsubscribe@commons.apache.org
> > For additional commands, e-mail: user-help@commons.apache.org
> >
> >

Mime
View raw message