spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 颜发才(Yan Facai) <yaf...@gmail.com>
Subject Re: Dataframe, Java: How to convert String to Vector ?
Date Wed, 21 Sep 2016 08:32:56 GMT
Thanks, Peter.
It works!

Why udf is needed?




On Wed, Sep 21, 2016 at 12:00 AM, Peter Figliozzi <pete.figliozzi@gmail.com>
wrote:

> Hi Yan, I agree, it IS really confusing.  Here is the technique for
> transforming a column.  It is very general because you can make "myConvert"
> do whatever you want.
>
> import org.apache.spark.mllib.linalg.Vectors
> val df = Seq((0, "[1,3,5]"), (1, "[2,4,6]")).toDF
>
> df.show()
> // The columns were named "_1" and "_2"
> // Very confusing, because it looks like a Scala wildcard when we refer to
> it in code
>
> val myConvert = (x: String) => { Vectors.parse(x) }
> val myConvertUDF = udf(myConvert)
>
> val newDf = df.withColumn("parsed", myConvertUDF(col("_2")))
>
> newDf.show()
>
> On Mon, Sep 19, 2016 at 3:29 AM, 颜发才(Yan Facai) <yafc18@gmail.com> wrote:
>
>> Hi, all.
>> I find that it's really confuse.
>>
>> I can use Vectors.parse to create a DataFrame contains Vector type.
>>
>>     scala> val dataVec = Seq((0, Vectors.parse("[1,3,5]")), (1,
>> Vectors.parse("[2,4,6]"))).toDF
>>     dataVec: org.apache.spark.sql.DataFrame = [_1: int, _2: vector]
>>
>>
>> But using map to convert String to Vector throws an error:
>>
>>     scala> val dataStr = Seq((0, "[1,3,5]"), (1, "[2,4,6]")).toDF
>>     dataStr: org.apache.spark.sql.DataFrame = [_1: int, _2: string]
>>
>>     scala> dataStr.map(row => Vectors.parse(row.getString(1)))
>>     <console>:30: error: Unable to find encoder for type stored in a
>> Dataset.  Primitive types (Int, String, etc) and Product types (case
>> classes) are supported by importing spark.implicits._  Support for
>> serializing other types will be added in future releases.
>>       dataStr.map(row => Vectors.parse(row.getString(1)))
>>
>>
>> Dose anyone can help me,
>> thanks very much!
>>
>>
>>
>>
>>
>>
>>
>> On Tue, Sep 6, 2016 at 9:58 PM, Peter Figliozzi <pete.figliozzi@gmail.com
>> > wrote:
>>
>>> Hi Yan, I think you'll have to map the features column to a new
>>> numerical features column.
>>>
>>> Here's one way to do the individual transform:
>>>
>>> scala> val x = "[1, 2, 3, 4, 5]"
>>> x: String = [1, 2, 3, 4, 5]
>>>
>>> scala> val y:Array[Int] = x slice(1, x.length - 1) replace(",", "")
>>> split(" ") map(_.toInt)
>>> y: Array[Int] = Array(1, 2, 3, 4, 5)
>>>
>>> If you don't know about the Scala command line, just type "scala" in a
>>> terminal window.  It's a good place to try things out.
>>>
>>> You can make a function out of this transformation and apply it to your
>>> features column to make a new column.  Then add this with
>>> Dataset.withColumn.
>>>
>>> See here
>>> <http://stackoverflow.com/questions/35227568/applying-function-to-spark-dataframe-column>
>>> on how to apply a function to a Column to make a new column.
>>>
>>> On Tue, Sep 6, 2016 at 1:56 AM, 颜发才(Yan Facai) <yafc18@gmail.com>
wrote:
>>>
>>>> Hi,
>>>> I have a csv file like:
>>>> uid      mid      features       label
>>>> 123    5231    [0, 1, 3, ...]    True
>>>>
>>>> Both  "features" and "label" columns are used for GBTClassifier.
>>>>
>>>> However, when I read the file:
>>>> Dataset<Row> samples = sparkSession.read().csv(file);
>>>> The type of samples.select("features") is String.
>>>>
>>>> My question is:
>>>> How to map samples.select("features") to Vector or any appropriate type,
>>>> so I can use it to train like:
>>>>         GBTClassifier gbdt = new GBTClassifier()
>>>>                 .setLabelCol("label")
>>>>                 .setFeaturesCol("features")
>>>>                 .setMaxIter(2)
>>>>                 .setMaxDepth(7);
>>>>
>>>> Thanks.
>>>>
>>>
>>>
>>
>

Mime
View raw message