spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Peter Figliozzi <pete.figlio...@gmail.com>
Subject Re: Dataframe, Java: How to convert String to Vector ?
Date Wed, 07 Sep 2016 14:14:24 GMT
Here's a decent GitHub book: Mastering Apache Spark
<https://www.gitbook.com/book/jaceklaskowski/mastering-apache-spark/details>
.

I'm new at Scala too.  I found it very helpful to study the Scala language
without Spark.  The documentation found here
<http://docs.scala-lang.org/index.html> is excellent.

Pete

On Wed, Sep 7, 2016 at 1:39 AM, 颜发才(Yan Facai) <yafc18@gmail.com> wrote:

> Hi Peter,
> I'm familiar with Pandas / Numpy in python,  while spark / scala is
> totally new for me.
> Pandas provides a detailed document, like how to slice data, parse file,
> use apply and filter function.
>
> Do spark have some more detailed document?
>
>
>
> On Tue, Sep 6, 2016 at 9:58 PM, Peter Figliozzi <pete.figliozzi@gmail.com>
> wrote:
>
>> Hi Yan, I think you'll have to map the features column to a new numerical
>> features column.
>>
>> Here's one way to do the individual transform:
>>
>> scala> val x = "[1, 2, 3, 4, 5]"
>> x: String = [1, 2, 3, 4, 5]
>>
>> scala> val y:Array[Int] = x slice(1, x.length - 1) replace(",", "")
>> split(" ") map(_.toInt)
>> y: Array[Int] = Array(1, 2, 3, 4, 5)
>>
>> If you don't know about the Scala command line, just type "scala" in a
>> terminal window.  It's a good place to try things out.
>>
>> You can make a function out of this transformation and apply it to your
>> features column to make a new column.  Then add this with
>> Dataset.withColumn.
>>
>> See here
>> <http://stackoverflow.com/questions/35227568/applying-function-to-spark-dataframe-column>
>> on how to apply a function to a Column to make a new column.
>>
>> On Tue, Sep 6, 2016 at 1:56 AM, 颜发才(Yan Facai) <yafc18@gmail.com> wrote:
>>
>>> Hi,
>>> I have a csv file like:
>>> uid      mid      features       label
>>> 123    5231    [0, 1, 3, ...]    True
>>>
>>> Both  "features" and "label" columns are used for GBTClassifier.
>>>
>>> However, when I read the file:
>>> Dataset<Row> samples = sparkSession.read().csv(file);
>>> The type of samples.select("features") is String.
>>>
>>> My question is:
>>> How to map samples.select("features") to Vector or any appropriate type,
>>> so I can use it to train like:
>>>         GBTClassifier gbdt = new GBTClassifier()
>>>                 .setLabelCol("label")
>>>                 .setFeaturesCol("features")
>>>                 .setMaxIter(2)
>>>                 .setMaxDepth(7);
>>>
>>> Thanks.
>>>
>>
>>
>

Mime
View raw message