spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Reynold Xin <>
Subject Re: from_csv
Date Sun, 16 Sep 2018 05:00:54 GMT
makes sense - i'd make this as consistent as to_json / from_json as

how would this work in sql? i.e. how would passing options in work?

excuse the brevity and lower case due to wrist injury

On Sat, Sep 15, 2018 at 2:58 AM Maxim Gekk <>

> Hi All,
> I would like to propose new function from_csv() for parsing columns
> containing strings in CSV format. Here is my PR:
> An use case is loading a dataset from an external storage, dbms or systems
> like Kafka to where CSV content was dumped as one of columns/fields. Other
> columns could contain related information like timestamps, ids, sources of
> data and etc. The column with CSV strings can be parsed by existing method
> csv() of DataFrameReader but in that case we have to "clean up" dataset
> and remove other columns since the csv() method requires Dataset[String].
> Joining back result of parsing and original dataset by positions is
> expensive and not convenient. Instead users parse CSV columns by string
> functions. The approach is usually error prone especially for quoted values
> and other special cases.
> The proposed in the PR methods should make a better user experience in
> parsing CSV-like columns. Please, share your thoughts.
> --
> Maxim Gekk
> Technical Solutions Lead
> Databricks Inc.
>   <>

View raw message