tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From sebb <seb...@gmail.com>
Subject Re: [csv] csv format detector/sniffer?
Date Mon, 25 Feb 2019 19:30:25 GMT
On Mon, 25 Feb 2019 at 18:38, Tim Allison <tallison@apache.org> wrote:
>
> Hi Gary,
>
> Our charset detector stuff is a combo of html-metaheader detection,
> juniversalchardet and a cut and paste of a small portion of icu4j...we
> could add that to commons-io, but I don't think you'd want to add
> juniversalchardet as a dependency or would you?  Happy to discuss...

I think the HTML stuff is out of scope for IO; not sure about the other bits.

> My main question to commons-csv was intended rather to focus on:
>
> 1) text vs csv detection (aside from filename glob)
> 2) detection of most likely: a) delimiter, b) quote character, c)
> escape character

That seems reasonable for CSV.

But it should probably be in its own package as it is somewhat outside
the rest of CSV.


>  More like:
>
> org.apache.commons.csv.CSVParser.parse(path, charset);
>
> or ideally:
>
> CSVFormat format = CSVDetector.detect(path)
>
> where format includes charset and one value is "probably straight
> text, not likely a csv"
>
> On Mon, Feb 25, 2019 at 10:39 AM Gary Gregory <garydgregory@gmail.com> wrote:
> >
> > Hi,
> >
> > A Charset detector sounds like something generally useful that belongs in
> > Commons IO.
> >
> > Path path = Path.get(...);
> > Charset cs = org.apache.commons.io.CharsetDetector.detect(path);
> > org.apache.commons.csv.CSVParser.parse(path, charset, csvFormat);
> >
> > Thoughts?
> >
> > Gary
> >
> >
> > On Mon, Feb 25, 2019 at 10:23 AM Tim Allison <tallison@apache.org> wrote:
> >
> > > Commons-CSV team,
> > >
> > >   We recently integrated Commons-CSV into Apache Tika.  For now, we’re
> > > relying strictly on the filename for csv detection, and we’re relying
> > > on our AutodetectReader to identify the charset.  It would be really
> > > useful for us to be able to detect:
> > >
> > > 1) A csv/tsv file vs a regular .txt file by content heuristics
> > > 2) The parameters: delimiter, escape and quote characters
> > >
> > >   We realize that no detection will be perfect, but we have two questions:
> > >
> > > 1) Do you have any pointers for this kind of thing?
> > > 2) If we develop it, would you want to put it in commons-csv or should
> > > we leave it in Tika?  I'm not sure, yet, if there'd be a clean/useful
> > > way to integrate this without using a charset detector...but we can
> > > hold off on that for now.
> > >
> > >   Thank you for all of your fantastic work!
> > >
> > >            Cheers,
> > >
> > >                            Tim
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: user-unsubscribe@commons.apache.org
> > > For additional commands, e-mail: user-help@commons.apache.org
> > >
> > >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@commons.apache.org
> For additional commands, e-mail: user-help@commons.apache.org
>

Mime
View raw message