Hi All,

I would like to propose new function from_csv() for parsing columns containing strings in CSV format. Here is my PR: https://github.com/apache/spark/pull/22379

An use case is loading a dataset from an external storage, dbms or systems like Kafka to where CSV content was dumped as one of columns/fields. Other columns could contain related information like timestamps, ids, sources of data and etc. The column with CSV strings can be parsed by existing method csv() of DataFrameReader but in that case we have to "clean up" dataset and remove other columns since the csv() method requires Dataset[String]. Joining back result of parsing and original dataset by positions is expensive and not convenient. Instead users parse CSV columns by string functions. The approach is usually error prone especially for quoted values and other special cases.

The proposed in the PR methods should make a better user experience in parsing CSV-like columns. Please, share your thoughts.


Maxim Gekk

Technical Solutions Lead

Databricks Inc.