spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 颜发才(Yan Facai) <yaf...@gmail.com>
Subject Re: Best practice for preprocessing feature with DataFrame
Date Thu, 17 Nov 2016 09:37:01 GMT
Could you give me an example, how to use Column function?
Thanks very much.

On Thu, Nov 17, 2016 at 12:23 PM, Divya Gehlot <divya.htconex@gmail.com>
wrote:

> Hi,
>
> You can use the Column functions provided by Spark API
>
> https://spark.apache.org/docs/1.6.2/api/java/org/apache/
> spark/sql/functions.html
>
> Hope this helps .
>
> Thanks,
> Divya
>
>
> On 17 November 2016 at 12:08, 颜发才(Yan Facai) <yafc18@gmail.com> wrote:
>
>> Hi,
>> I have a sample, like:
>> +---+------+--------------------+
>> |age|gender|             city_id|
>> +---+------+--------------------+
>> |   |     1|1042015:city_2044...|
>> |90s|     2|1042015:city_2035...|
>> |80s|     2|1042015:city_2061...|
>> +---+------+--------------------+
>>
>> and expectation is:
>> "age":  90s -> 90, 80s -> 80
>> "gender": 1 -> "male", 2 -> "female"
>>
>> I have two solutions:
>> 1. Handle each column separately,  and then join all by index.
>>     val age = input.select("age").map(...)
>>     val gender = input.select("gender").map(...)
>>     val result = ...
>>
>> 2. Write utf function for each column, and then use in together:
>>      val result = input.select(ageUDF($"age"), genderUDF($"gender"))
>>
>> However, both are awkward,
>>
>> Does anyone have a better work flow?
>> Write some custom Transforms and use pipeline?
>>
>> Thanks.
>>
>>
>>
>>
>

Mime
View raw message