Even if it worked, using subtract doesn't seem like a good way to achieve
this. You could try something like:
def split[T : ClassManifest](data: RDD[T], p: Double, seed: Long =
System.currentTimeMillis): (RDD[T], RDD[T]) = {
val rand = new java.util.Random(seed)
val temp = data.map(x => (x, rand.nextDouble))
(temp.filter(_._2 <= p).map(_._1), temp.filter(_._2 > p).map(_._1))
}
Note: this code compiles, but I haven't tested it yet...
On Thu, Sep 12, 2013 at 1:18 AM, Hao REN <julien19890118@gmail.com> wrote:
> Hi,
>
> I am writing a logistic regression prog with Spark based on SparkLR
> example.
>
> Say, a data set containing 10000 DataPoints, where DataPoint is a case
> class like: case class DataPoint(x: Vector, y: Double) as defined in the
> SparkLR example.
>
> In order to divide the data set into 2 parts: training set and test set, I
> tried some code below:
>
> val trainingSet = points.sample(false, 0.6, 7)
> val testSet = points.subtract(trainingSet)
>
> ,where points is a RDD[DataPoint] contains 10000 points
>
> sample works well, trainingSet.count gives a number around 6000, but
> testSet.count gives 10000 which is not the expected 4000.
>
> It seems that subtract cant work with some custom class, as DataPoint
> here.
>
> 2 questions:
>
> 1) Which is the best way to divide data with a ratio, say 6/4, especially
> when Data is not a primitive type, like some custom classes ?
>
> 2) Why subtract doesn't work ? Maybe ordering and compare should be
> implemented for DataPoint class ?
>
>
> I have also checked the SubtractedRDD class. Without background about the
> Spark source code, I can not understand what the problem is.
>
> https://github.com/mesos/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/SubtractedRDD.scala
>
>
> Any help is highly appreciated !
>
> Thank you in advance. =)
>
> Hao
>
>
> 
> REN Hao
>
> Etudiant d'échange à l'Ecole Polytechnique Fédérale de Lausanne (EPFL)
>
> Computer Science
>
> Etudiant à l'Université de Technologie de Compiègne (UTC)
>
> Génie Informatique  Fouille de Données
>
> Tel: +33 06 14 54 57 24 ／ +41 07 86 47 52 69
>
