spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 颜发才(Yan Facai) <yaf...@gmail.com>
Subject Best practice for preprocessing feature with DataFrame
Date Thu, 17 Nov 2016 04:08:26 GMT
Hi,
I have a sample, like:
+---+------+--------------------+
|age|gender|             city_id|
+---+------+--------------------+
|   |     1|1042015:city_2044...|
|90s|     2|1042015:city_2035...|
|80s|     2|1042015:city_2061...|
+---+------+--------------------+

and expectation is:
"age":  90s -> 90, 80s -> 80
"gender": 1 -> "male", 2 -> "female"

I have two solutions:
1. Handle each column separately,  and then join all by index.
    val age = input.select("age").map(...)
    val gender = input.select("gender").map(...)
    val result = ...

2. Write utf function for each column, and then use in together:
     val result = input.select(ageUDF($"age"), genderUDF($"gender"))

However, both are awkward,

Does anyone have a better work flow?
Write some custom Transforms and use pipeline?

Thanks.

Mime
View raw message