spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Hamstra <m...@clearstorydata.com>
Subject Re: RDD.subtract doesn't work
Date Thu, 12 Sep 2013 17:20:05 GMT
That's not really the best way to handle random number generation.  There
have been multiple discussions on
https://groups.google.com/forum/?fromgroups=#!forum/spark-users and
elsewhere about how to use mapPartitions or mapWith to create
higher-performance Spark code that uses PRNGs.


On Thu, Sep 12, 2013 at 9:55 AM, Jason Lenderman <jslenderman@gmail.com>wrote:

> Even if it worked, using subtract doesn't seem like a good way to achieve
> this. You could try something like:
>
> def split[T : ClassManifest](data: RDD[T], p: Double, seed: Long =
> System.currentTimeMillis): (RDD[T], RDD[T]) = {
>   val rand = new java.util.Random(seed)
>   val temp = data.map(x => (x, rand.nextDouble))
>   (temp.filter(_._2 <= p).map(_._1), temp.filter(_._2 > p).map(_._1))
> }
>
> Note: this code compiles, but I haven't tested it yet...
>
>
> On Thu, Sep 12, 2013 at 1:18 AM, Hao REN <julien19890118@gmail.com> wrote:
>
>> Hi,
>>
>> I am writing a logistic regression prog with Spark based on SparkLR
>> example.
>>
>> Say, a data set containing 10000 DataPoints, where DataPoint is a case
>> class like:  case class DataPoint(x: Vector, y: Double) as defined in
>> the SparkLR example.
>>
>> In order to divide the data set into 2 parts: training set and test set,
>> I tried some code below:
>>
>> val trainingSet = points.sample(false, 0.6, 7)
>>  val testSet = points.subtract(trainingSet)
>>
>> ,where points is a RDD[DataPoint] contains 10000 points
>>
>> sample works well, trainingSet.count gives a number around 6000, but
>> testSet.count gives 10000 which is not the expected 4000.
>>
>> It seems that subtract cant work with some custom class, as DataPoint
>> here.
>>
>> 2 questions:
>>
>> 1) Which is the best way to divide data with a ratio, say 6/4, especially
>> when Data is not a primitive type, like some custom classes ?
>>
>> 2) Why subtract doesn't work ? Maybe ordering and compare should be
>> implemented for DataPoint class ?
>>
>>
>> I have also checked the SubtractedRDD class. Without background about the
>> Spark source code, I can not understand what the problem is.
>>
>> https://github.com/mesos/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/SubtractedRDD.scala
>>
>>
>> Any help is highly appreciated !
>>
>> Thank you in advance. =)
>>
>> Hao
>>
>>
>> --
>> REN Hao
>>
>> Etudiant d'échange à l'Ecole Polytechnique Fédérale de Lausanne (EPFL)
>>
>> Computer Science
>>
>> Etudiant à l'Université de Technologie de Compiègne (UTC)
>>
>> Génie Informatique - Fouille de Données
>>
>> Tel:  +33 06 14 54 57 24  /  +41 07 86 47 52 69
>>
>
>

Mime
View raw message