spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Maxim Gekk <maxim.g...@databricks.com>
Subject from_csv
Date Sat, 15 Sep 2018 09:57:58 GMT
Hi All,

I would like to propose new function from_csv() for parsing columns
containing strings in CSV format. Here is my PR:
https://github.com/apache/spark/pull/22379

An use case is loading a dataset from an external storage, dbms or systems
like Kafka to where CSV content was dumped as one of columns/fields. Other
columns could contain related information like timestamps, ids, sources of
data and etc. The column with CSV strings can be parsed by existing method
csv() of DataFrameReader but in that case we have to "clean up" dataset and
remove other columns since the csv() method requires Dataset[String].
Joining back result of parsing and original dataset by positions is
expensive and not convenient. Instead users parse CSV columns by string
functions. The approach is usually error prone especially for quoted values
and other special cases.

The proposed in the PR methods should make a better user experience in
parsing CSV-like columns. Please, share your thoughts.

-- 

Maxim Gekk

Technical Solutions Lead

Databricks Inc.

maxim.gekk@databricks.com

databricks.com

  <http://databricks.com/>

Mime
View raw message