spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 颜发才(Yan Facai) <yaf...@gmail.com>
Subject Re: Best practice for preprocessing feature with DataFrame
Date Wed, 23 Nov 2016 04:09:23 GMT
Thanks, White.

On Thu, Nov 17, 2016 at 11:15 PM, Stuart White <stuart.white1@gmail.com>
wrote:

> Sorry.  Small typo.  That last part should be:
>
> val modifiedRows = rows
>   .select(
>     substring('age, 0, 2) as "age",
>     when('gender === 1, "male").otherwise(when('gender === 2,
> "female").otherwise("unknown")) as "gender"
>   )
> modifiedRows.show
>
> +---+-------+
> |age| gender|
> +---+-------+
> | 90|   male|
> | 80| female|
> | 80|unknown|
> +---+-------+
>
> On Thu, Nov 17, 2016 at 8:57 AM, Stuart White <stuart.white1@gmail.com>
> wrote:
> > import org.apache.spark.sql.functions._
> >
> > val rows = Seq(("90s", 1), ("80s", 2), ("80s", 3)).toDF("age", "gender")
> > rows.show
> >
> > +---+------+
> > |age|gender|
> > +---+------+
> > |90s|     1|
> > |80s|     2|
> > |80s|     3|
> > +---+------+
> >
> > val modifiedRows
> >   .select(
> >     substring('age, 0, 2) as "age",
> >     when('gender === 1, "male").otherwise(when('gender === 2,
> > "female").otherwise("unknown")) as "gender"
> >   )
> > modifiedRows.show
> >
> > +---+-------+
> > |age| gender|
> > +---+-------+
> > | 90|   male|
> > | 80| female|
> > | 80|unknown|
> > +---+-------+
> >
> > On Thu, Nov 17, 2016 at 3:37 AM, 颜发才(Yan Facai) <yafc18@gmail.com>
> wrote:
> >> Could you give me an example, how to use Column function?
> >> Thanks very much.
> >>
> >> On Thu, Nov 17, 2016 at 12:23 PM, Divya Gehlot <divya.htconex@gmail.com
> >
> >> wrote:
> >>>
> >>> Hi,
> >>>
> >>> You can use the Column functions provided by Spark API
> >>>
> >>>
> >>> https://spark.apache.org/docs/1.6.2/api/java/org/apache/
> spark/sql/functions.html
> >>>
> >>> Hope this helps .
> >>>
> >>> Thanks,
> >>> Divya
> >>>
> >>>
> >>> On 17 November 2016 at 12:08, 颜发才(Yan Facai) <yafc18@gmail.com>
wrote:
> >>>>
> >>>> Hi,
> >>>> I have a sample, like:
> >>>> +---+------+--------------------+
> >>>> |age|gender|             city_id|
> >>>> +---+------+--------------------+
> >>>> |   |     1|1042015:city_2044...|
> >>>> |90s|     2|1042015:city_2035...|
> >>>> |80s|     2|1042015:city_2061...|
> >>>> +---+------+--------------------+
> >>>>
> >>>> and expectation is:
> >>>> "age":  90s -> 90, 80s -> 80
> >>>> "gender": 1 -> "male", 2 -> "female"
> >>>>
> >>>> I have two solutions:
> >>>> 1. Handle each column separately,  and then join all by index.
> >>>>     val age = input.select("age").map(...)
> >>>>     val gender = input.select("gender").map(...)
> >>>>     val result = ...
> >>>>
> >>>> 2. Write utf function for each column, and then use in together:
> >>>>      val result = input.select(ageUDF($"age"), genderUDF($"gender"))
> >>>>
> >>>> However, both are awkward,
> >>>>
> >>>> Does anyone have a better work flow?
> >>>> Write some custom Transforms and use pipeline?
> >>>>
> >>>> Thanks.
> >>>>
> >>>>
> >>>>
> >>>
> >>
>

Mime
View raw message