tika-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim Allison <talli...@apache.org>
Subject [csv] csv format detector/sniffer?
Date Mon, 25 Feb 2019 15:23:38 GMT
Commons-CSV team,

  We recently integrated Commons-CSV into Apache Tika.  For now, we’re
relying strictly on the filename for csv detection, and we’re relying
on our AutodetectReader to identify the charset.  It would be really
useful for us to be able to detect:

1) A csv/tsv file vs a regular .txt file by content heuristics
2) The parameters: delimiter, escape and quote characters

  We realize that no detection will be perfect, but we have two questions:

1) Do you have any pointers for this kind of thing?
2) If we develop it, would you want to put it in commons-csv or should
we leave it in Tika?  I'm not sure, yet, if there'd be a clean/useful
way to integrate this without using a charset detector...but we can
hold off on that for now.

  Thank you for all of your fantastic work!



View raw message