spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reynold Xin <r...@databricks.com>
Subject Re: from_csv
Date Sun, 16 Sep 2018 05:00:54 GMT
makes sense - i'd make this as consistent as to_json / from_json as
possible.

how would this work in sql? i.e. how would passing options in work?

--
excuse the brevity and lower case due to wrist injury


On Sat, Sep 15, 2018 at 2:58 AM Maxim Gekk <maxim.gekk@databricks.com>
wrote:

> Hi All,
>
> I would like to propose new function from_csv() for parsing columns
> containing strings in CSV format. Here is my PR:
> https://github.com/apache/spark/pull/22379
>
> An use case is loading a dataset from an external storage, dbms or systems
> like Kafka to where CSV content was dumped as one of columns/fields. Other
> columns could contain related information like timestamps, ids, sources of
> data and etc. The column with CSV strings can be parsed by existing method
> csv() of DataFrameReader but in that case we have to "clean up" dataset
> and remove other columns since the csv() method requires Dataset[String].
> Joining back result of parsing and original dataset by positions is
> expensive and not convenient. Instead users parse CSV columns by string
> functions. The approach is usually error prone especially for quoted values
> and other special cases.
>
> The proposed in the PR methods should make a better user experience in
> parsing CSV-like columns. Please, share your thoughts.
>
> --
>
> Maxim Gekk
>
> Technical Solutions Lead
>
> Databricks Inc.
>
> maxim.gekk@databricks.com
>
> databricks.com
>
>   <http://databricks.com/>
>

Mime
View raw message