spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jason Lenderman <jslender...@gmail.com>
Subject Re: RDD.subtract doesn't work
Date Thu, 12 Sep 2013 16:55:49 GMT
Even if it worked, using subtract doesn't seem like a good way to achieve
this. You could try something like:

def split[T : ClassManifest](data: RDD[T], p: Double, seed: Long =
System.currentTimeMillis): (RDD[T], RDD[T]) = {
  val rand = new java.util.Random(seed)
  val temp = data.map(x => (x, rand.nextDouble))
  (temp.filter(_._2 <= p).map(_._1), temp.filter(_._2 > p).map(_._1))
}

Note: this code compiles, but I haven't tested it yet...


On Thu, Sep 12, 2013 at 1:18 AM, Hao REN <julien19890118@gmail.com> wrote:

> Hi,
>
> I am writing a logistic regression prog with Spark based on SparkLR
> example.
>
> Say, a data set containing 10000 DataPoints, where DataPoint is a case
> class like:  case class DataPoint(x: Vector, y: Double) as defined in the
> SparkLR example.
>
> In order to divide the data set into 2 parts: training set and test set, I
> tried some code below:
>
> val trainingSet = points.sample(false, 0.6, 7)
>  val testSet = points.subtract(trainingSet)
>
> ,where points is a RDD[DataPoint] contains 10000 points
>
> sample works well, trainingSet.count gives a number around 6000, but
> testSet.count gives 10000 which is not the expected 4000.
>
> It seems that subtract cant work with some custom class, as DataPoint
> here.
>
> 2 questions:
>
> 1) Which is the best way to divide data with a ratio, say 6/4, especially
> when Data is not a primitive type, like some custom classes ?
>
> 2) Why subtract doesn't work ? Maybe ordering and compare should be
> implemented for DataPoint class ?
>
>
> I have also checked the SubtractedRDD class. Without background about the
> Spark source code, I can not understand what the problem is.
>
> https://github.com/mesos/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/SubtractedRDD.scala
>
>
> Any help is highly appreciated !
>
> Thank you in advance. =)
>
> Hao
>
>
> --
> REN Hao
>
> Etudiant d'échange à l'Ecole Polytechnique Fédérale de Lausanne (EPFL)
>
> Computer Science
>
> Etudiant à l'Université de Technologie de Compiègne (UTC)
>
> Génie Informatique - Fouille de Données
>
> Tel:  +33 06 14 54 57 24  /  +41 07 86 47 52 69
>

Mime
View raw message