I'm sure there's another way to do it; I hope some= one can show us.=C2=A0 I couldn't figure out how to use `map` either. = =C2=A0

On We= d, Sep 21, 2016 at 3:32 AM, =E9=A2=9C=E5=8F=91=E6=89=8D(Yan Facai) wrote:
Thanks, Peter.
It works!

Why udf i= s needed?

On Wed, Sep 21, 2016= at 12:00 AM, Peter Figliozzi <pete.figliozzi@gmail.com> wrote:
Hi Yan, I agr= ee, it IS really confusing.=C2=A0 Here is the technique for transforming a = column.=C2=A0 It is very general because you can make "myConvert"= do whatever you want.

import org.apache.spark.mllib.linalg.Vectors
= val df =3D Seq((0, "[1,3,5]")= , (1, "[2,4,6]")).toDF

df.sho= w()
// The columns wer= e named "_1" and "_2"
// Very confusing, because it looks like a Scala wildca= rd when we refer to it in code

val myCo= nvert =3D (x: String) =3D> { Vectors.parse(x) }
val myConvertUDF =3D udf(myConvert)

val newDf =3D df.withColumn("parsed", m= yConvertUDF(col("_2")))

newDf= .show()

On Mon, Sep 19, 2016 at 3:29 AM, =E9=A2=9C=E5=8F= =91=E6=89=8D(Yan Facai) wrote:
Hi, all.
I find = that it's really confuse.

I can use Vectors.parse to = create a DataFrame contains Vector type.

=C2=A0=C2=A0=C2=A0 sc= ala> val dataVec =3D Seq((0, Vectors.parse("[1,3,5]")), (1, Ve= ctors.parse("[2,4,6]"))).toDF
=C2=A0=C2=A0=C2=A0 dataVec:= org.apache.spark.sql.DataFrame =3D [_1: int, _2: vector]

<= div>
But using map to convert String to Vector throws an erro= r:

=C2=A0=C2=A0=C2=A0 scala> val dataStr =3D Seq((0, &= quot;[1,3,5]"), (1, "[2,4,6]")).toDF
=C2=A0=C2=A0=C2=A0 d= ataStr: org.apache.spark.sql.DataFrame =3D [_1: int, _2: string]
=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0
=C2=A0=C2=A0=C2=A0 scala> dataStr.map(row =3D> Vecto= rs.parse(row.getString(1)))
=C2=A0=C2=A0=C2=A0 <console>:30: = error: Unable to find encoder for type stored in a Dataset.=C2=A0 Primitive= types (Int, String, etc) and Product types (case classes) are supported by= importing spark.implicits._=C2=A0 Support for serializing other types will= be added in future releases.
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 dataStr.map= (row =3D> Vectors.parse(row.getString(1)))

Do= se anyone can help me,
thanks very much!

<= br>

=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0

<= /div>

On Tue, Sep 6, 2016 at 9:58 PM, Peter Figliozzi wrote:
Hi Yan, I think you'll have to map th= e features column to a new numerical features column.

He= re's one way to do the individual transform:

<= div>scala> val x =3D "[1, 2, 3,= 4, 5]"
x: String= =3D [1, 2, 3, 4, 5]
<= br>
scala> val y:Ar= ray[Int] =3D x slice(1, x.length - 1) replace(",", "") = split(" ") map(_.toInt)
y: Array[Int] =3D Array(1, 2, 3, 4, 5)
<= font face=3D"monospace, monospace">
If you don't know about the Scala command lin= e, just type "scala" in a terminal window.=C2=A0 It's a good = place to try things out.

You can make a function out of this transformation = and apply it to your features column to make a new column.=C2=A0 Then add t= his with Dataset.withColumn.

See = here=C2=A0 on how to apply a function to a Column to make a new column.=

On Tue, Sep 6, 2016 at 1:56 AM, =E9=A2=9C=E5=8F=91=E6=89= =8D(Yan Facai) wrote:
Hi,
I hav= e a csv file like:
uid=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 mid=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0 features =C2=A0 =C2=A0 =C2=A0 label
12= 3=C2=A0=C2=A0=C2=A0 5231=C2=A0=C2=A0=C2=A0 [0, 1, 3, ...]=C2=A0=C2=A0=C2=A0= True

Both=C2=A0 "features" and "label" columns = are used for GBTClassifier.

However, when I read th= e file:
The type of samples.select("features") is String.

<= /div>
My question is:
How to map samples.select("fea= tures") to Vector or any appropriate type,
so I can use it to train like:
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0 GBTClassifier gbdt =3D new GBTClassifier()
=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 .s= etLabelCol("label")
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 .setFeaturesCol("feat= ures")
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 .setMaxIter(2)
=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 .setMaxD= epth(7);

Thanks.

--001a1134f3fe4a43a4053d0547f5--