Hi,
I am writing a logistic regression prog with Spark based on SparkLR example.
Say, a data set containing 10000 DataPoints, where DataPoint is a case
class like: case class DataPoint(x: Vector, y: Double) as defined in the
SparkLR example.
In order to divide the data set into 2 parts: training set and test set, I
tried some code below:
val trainingSet = points.sample(false, 0.6, 7)
val testSet = points.subtract(trainingSet)
,where points is a RDD[DataPoint] contains 10000 points
sample works well, trainingSet.count gives a number around 6000, but
testSet.count gives 10000 which is not the expected 4000.
It seems that subtract cant work with some custom class, as DataPoint here.
2 questions:
1) Which is the best way to divide data with a ratio, say 6/4, especially
when Data is not a primitive type, like some custom classes ?
2) Why subtract doesn't work ? Maybe ordering and compare should be
implemented for DataPoint class ?
I have also checked the SubtractedRDD class. Without background about the
Spark source code, I can not understand what the problem is.
https://github.com/mesos/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/SubtractedRDD.scala
Any help is highly appreciated !
Thank you in advance. =)
Hao

REN Hao
Etudiant d'échange à l'Ecole Polytechnique Fédérale de Lausanne (EPFL)
Computer Science
Etudiant à l'Université de Technologie de Compiègne (UTC)
Génie Informatique  Fouille de Données
Tel: +33 06 14 54 57 24 ／ +41 07 86 47 52 69
