spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 张建鑫(市场部) <zhangjian...@didichuxing.com>
Subject Re: LabeledPoint creation
Date Thu, 08 Sep 2016 12:05:12 GMT
Hi,
Below are what I typed in my scale-sql command line based on your first email, the result
is different with yours. Just for your reference.
My spark version is 1.6.1

import org.apache.spark.ml.feature._
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.mllib.regression.LabeledPoint

val df=    sqlContext.createDataFrame(Seq(
      (0, "a"),
      (1, "b"),
      (2, "c"),
      (3, "a"),
      (4, "a"),
      (5, "c"),
      (6, "d"))).toDF("id", "category")

    val indexer = new StringIndexer().setInputCol("category").setOutputCol("categoryIndex").fit(df)

    val indexed = indexer.transform(df)

    indexed.select("category", "categoryIndex").show()

    val encoder = new OneHotEncoder().setInputCol("categoryIndex").setOutputCol("categoryVec")
    val encoded = encoder.transform(indexed)

     encoded.select("id", "category", "categoryVec").show()
val data = encoded.rdd.map { x =>
      {
        val featureVector = Vectors.dense(x.getAs[org.apache.spark.mllib.linalg.SparseVector]("categoryVec").toArray)
        val label = x.getAs[java.lang.Integer]("id").toDouble
        LabeledPoint(label, featureVector)
      }
    }
var result = sqlContext.createDataFrame(data)

scala> result.show()
+-----+-------------+
|label|     features|
+-----+-------------+
|  0.0|[1.0,0.0,0.0]|
|  1.0|[0.0,0.0,1.0]|
|  2.0|[0.0,1.0,0.0]|
|  3.0|[1.0,0.0,0.0]|
|  4.0|[1.0,0.0,0.0]|
|  5.0|[0.0,1.0,0.0]|
|  6.0|[0.0,0.0,0.0]|
+-----+-------------+
发件人: Madabhattula Rajesh Kumar <mrajaforu@gmail.com<mailto:mrajaforu@gmail.com>>
日期: 2016年9月8日 星期四 下午2:10
至: "aka.fe2s" <aka.fe2s@gmail.com<mailto:aka.fe2s@gmail.com>>
抄送: "user@spark.apache.org<mailto:user@spark.apache.org>" <user@spark.apache.org<mailto:user@spark.apache.org>>
主题: Re: LabeledPoint creation

Hi,

I have done this in different way. Please correct me, is this approach right ?

val df = spark.createDataFrame(Seq(
      (0, "a"),
      (1, "b"),
      (2, "c"),
      (3, "a"),
      (4, "a"),
      (5, "c"),
      (6, "d"))).toDF("id", "category")

   val categories: List[String] = List("a", "b", "c", "d")
    val categoriesList: Array[Double] = new Array[Double](categories.size)
    val labelPoint = df.rdd.map { line =>
      val values = line.getAs("category").toString()
      val id = line.getAs[java.lang.Integer]("id").toDouble
      var i = -1
      categories.foreach { x => i += 1; categoriesList(i) = if (x == values) 1.0 else 0.0
}
      val denseVector = Vectors.dense(categoriesList)
      LabeledPoint(id, denseVector)
    }
    labelPoint.foreach { x => println(x) }

Output :-

(0.0,[1.0,0.0,0.0,0.0])
(1.0,[0.0,1.0,0.0,0.0])
(2.0,[0.0,0.0,1.0,0.0])
(3.0,[1.0,0.0,0.0,0.0])
(4.0,[1.0,0.0,0.0,0.0])
(5.0,[0.0,0.0,1.0,0.0])
(6.0,[0.0,0.0,0.0,1.0])

Regards,
Rajesh


On Thu, Sep 8, 2016 at 12:35 AM, aka.fe2s <aka.fe2s@gmail.com<mailto:aka.fe2s@gmail.com>>
wrote:
It has 4 categories
a = 1 0 0
b = 0 0 0
c = 0 1 0
d = 0 0 1

--
Oleksiy Dyagilev

On Wed, Sep 7, 2016 at 10:42 AM, Madabhattula Rajesh Kumar <mrajaforu@gmail.com<mailto:mrajaforu@gmail.com>>
wrote:
Hi,

Any help on above mail use case ?

Regards,
Rajesh

On Tue, Sep 6, 2016 at 5:40 PM, Madabhattula Rajesh Kumar <mrajaforu@gmail.com<mailto:mrajaforu@gmail.com>>
wrote:
Hi,

I am new to Spark ML, trying to create a LabeledPoint from categorical dataset(example code
from spark). For this, I am using One-hot encoding<http://en.wikipedia.org/wiki/One-hot>
feature. Below is my code

val df = sparkSession.createDataFrame(Seq(
      (0, "a"),
      (1, "b"),
      (2, "c"),
      (3, "a"),
      (4, "a"),
      (5, "c"),
      (6, "d"))).toDF("id", "category")

    val indexer = new StringIndexer()
      .setInputCol("category")
      .setOutputCol("categoryIndex")
      .fit(df)

    val indexed = indexer.transform(df)

    indexed.select("category", "categoryIndex").show()

    val encoder = new OneHotEncoder()
      .setInputCol("categoryIndex")
      .setOutputCol("categoryVec")
    val encoded = encoder.transform(indexed)

     encoded.select("id", "category", "categoryVec").show()

Output :-
+---+--------+-------------+
| id|category|  categoryVec|
+---+--------+-------------+
|  0|       a|(3,[0],[1.0])|
|  1|       b|    (3,[],[])|
|  2|       c|(3,[1],[1.0])|
|  3|       a|(3,[0],[1.0])|
|  4|       a|(3,[0],[1.0])|
|  5|       c|(3,[1],[1.0])|
|  6|       d|(3,[2],[1.0])|
+---+--------+-------------+

Creating LablePoint from encoded dataframe:-

val data = encoded.rdd.map { x =>
      {
        val featureVector = Vectors.dense(x.getAs[org.apache.spark.ml.linalg.SparseVector]("categoryVec").toArray)
        val label = x.getAs[java.lang.Integer]("id").toDouble
        LabeledPoint(label, featureVector)
      }
    }

    data.foreach { x => println(x) }

Output :-

(0.0,[1.0,0.0,0.0])
(1.0,[0.0,0.0,0.0])
(2.0,[0.0,1.0,0.0])
(3.0,[1.0,0.0,0.0])
(4.0,[1.0,0.0,0.0])
(5.0,[0.0,1.0,0.0])
(6.0,[0.0,0.0,1.0])

I have a four categorical values like a, b, c, d. I am expecting 4 features in the above LablePoint
but it has only 3 features.

Please help me to creation of LablePoint from categorical features.

Regards,
Rajesh





Mime
View raw message