spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dongjin Lee <dong...@apache.org>
Subject Re: from_csv
Date Wed, 19 Sep 2018 14:20:52 GMT
Another +1.

I already experienced this case several times.

On Mon, Sep 17, 2018 at 11:03 AM Hyukjin Kwon <gurwls223@gmail.com> wrote:

> +1 for this idea since text parsing in CSV/JSON is quite common.
>
> One thing is about schema inference likewise with JSON functionality. In
> case of JSON, we added schema_of_json for it and same thing should be able
> to apply to CSV too.
> If we see some more needs for it, we can consider a function like
> schema_of_csv as well.
>
>
> 2018년 9월 16일 (일) 오후 4:41, Maxim Gekk <maxim.gekk@databricks.com>님이
작성:
>
>> Hi Reynold,
>>
>> > i'd make this as consistent as to_json / from_json as possible
>>
>> Sure, new function from_csv() has the same signature as from_json().
>>
>> > how would this work in sql? i.e. how would passing options in work?
>>
>> The options are passed to the function via map, for example:
>> select from_csv('26/08/2015', 'time Timestamp', map('timestampFormat',
>> 'dd/MM/yyyy'))
>>
>> On Sun, Sep 16, 2018 at 7:01 AM Reynold Xin <rxin@databricks.com> wrote:
>>
>>> makes sense - i'd make this as consistent as to_json / from_json as
>>> possible.
>>>
>>> how would this work in sql? i.e. how would passing options in work?
>>>
>>> --
>>> excuse the brevity and lower case due to wrist injury
>>>
>>>
>>> On Sat, Sep 15, 2018 at 2:58 AM Maxim Gekk <maxim.gekk@databricks.com>
>>> wrote:
>>>
>>>> Hi All,
>>>>
>>>> I would like to propose new function from_csv() for parsing columns
>>>> containing strings in CSV format. Here is my PR:
>>>> https://github.com/apache/spark/pull/22379
>>>>
>>>> An use case is loading a dataset from an external storage, dbms or
>>>> systems like Kafka to where CSV content was dumped as one of
>>>> columns/fields. Other columns could contain related information like
>>>> timestamps, ids, sources of data and etc. The column with CSV strings can
>>>> be parsed by existing method csv() of DataFrameReader but in that case
>>>> we have to "clean up" dataset and remove other columns since the csv()
>>>> method requires Dataset[String]. Joining back result of parsing and
>>>> original dataset by positions is expensive and not convenient. Instead
>>>> users parse CSV columns by string functions. The approach is usually error
>>>> prone especially for quoted values and other special cases.
>>>>
>>>> The proposed in the PR methods should make a better user experience in
>>>> parsing CSV-like columns. Please, share your thoughts.
>>>>
>>>> --
>>>>
>>>> Maxim Gekk
>>>>
>>>> Technical Solutions Lead
>>>>
>>>> Databricks Inc.
>>>>
>>>> maxim.gekk@databricks.com
>>>>
>>>> databricks.com
>>>>
>>>>   <http://databricks.com/>
>>>>
>>>
>>

-- 
*Dongjin Lee*

*A hitchhiker in the mathematical world.*

*github:  <http://goog_969573159/>github.com/dongjinleekr
<http://github.com/dongjinleekr>linkedin: kr.linkedin.com/in/dongjinleekr
<http://kr.linkedin.com/in/dongjinleekr>slideshare:
www.slideshare.net/dongjinleekr
<http://www.slideshare.net/dongjinleekr>*

Mime
View raw message