spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From java bigdata <hadoopst...@gmail.com>
Subject Re: Dataframe Transformation with Inner fields in Complex Datatypes.
Date Wed, 20 Jul 2016 03:00:02 GMT
Hi Ayan,
Thanks for your update.

All i am trying is to update an inner field in one of the
dataframe's complex type column. withColumn method adds or replaces
existing column. In my case column is a nested column. Please see the below
example i mentioned in the mail.

I dont have to add a new column. One way of my thinking to solve this is to
create a new complex type column(structtype), same as the one available in
dataframe, and during the process update the nested field. At the end add
the newly created struct type column to the dataframe and drop old one.
Disadvantage:
1. However, this will require iterating through millions of rows leading to
perf impact.
2. If there is only one/few columns to be updated, it may not be right way
to create a new column and add to dataframe.

Any help will be greatly appreciated!
Thanks.

On Monday, July 18, 2016, ayan guha <guha.ayan@gmail.com> wrote:

> Hi
>
> withColumn adds the column. If you want different name, please use
> .alias() function.
>
> On Mon, Jul 18, 2016 at 2:16 AM, java bigdata <hadoopstack@gmail.com
> <javascript:_e(%7B%7D,'cvml','hadoopstack@gmail.com');>> wrote:
>
>> Hi Team,
>>
>> I am facing a major issue while transforming dataframe containing complex
>> datatype columns. I need to update the inner fields of complex datatype,
>> for eg: converting one inner field to UPPERCASE letters, and return the
>> same dataframe with new upper case values in it. Below is my issue
>> description. Kindly suggest/guide me a way forward.
>>
>> *My suggestion: *can we have a new version of *dataframe.withcolumn(<innerfieldreference>,
>> udf($innerfieldreference), <reference or colname indicator argument>)*,
>> so that when this method gets executed, i get same dataframe with
>> transformed values.
>>
>>
>> *Issue Description:*
>> Using dataframe.withColumn(<colname>,udf($colname)) for inner fields in
>> struct/complex datatype, results in a new dataframe with the a new column
>> appended to it. "colname" in the above argument is given as fullname with
>> dot notation to access the struct/complex fields.
>>
>> For eg: hive table has columns: (id int, address struct<line1: struct<
>> buildname:string, stname:string>>, line2:string>)
>>
>> I need to update the inner field 'buildname'. I can select the inner
>> field through dataframe as : df.select($"address.line1.buildname"), however
>> when I use df.withColumn("address.line1.buildname",
>> toUpperCaseUDF($"address.line1.buildname")), it is resulting in a new
>> dataframe with new column: "address.line1.buildname" appended, with
>> toUpperCaseUDF values from inner field buildname.
>>
>> How can I update the inner fields of the complex data types. Kindly
>> suggest.
>>
>> Thanks in anticipation.
>>
>> Best Regards,
>> Naveen Kumar.
>>
>
>
>
> --
> Best Regards,
> Ayan Guha
>

Mime
View raw message