spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nan Zhu <zhunanmcg...@gmail.com>
Subject Re: MLUtil.kfold generates overlapped training and validation set?
Date Fri, 10 Oct 2014 14:57:17 GMT
Thanks, Xiangrui,   

I found the reason of overlapped training set and test set

….

Another counter-intuitive issue related to https://github.com/apache/spark/pull/2508

Best,  

--  
Nan Zhu


On Friday, October 10, 2014 at 2:19 AM, Xiangrui Meng wrote:

> 1. No.
>  
> 2. The seed per partition is fixed. So it should generate
> non-overlapping subsets.
>  
> 3. There was a bug in 1.0, which was fixed in 1.0.1 and 1.1.
>  
> Best,
> Xiangrui
>  
> On Thu, Oct 9, 2014 at 11:05 AM, Nan Zhu <zhunanmcgill@gmail.com (mailto:zhunanmcgill@gmail.com)>
wrote:
> > Hi, all
> >  
> > When we use MLUtils.kfold to generate training and validation set for cross
> > validation
> >  
> > we found that there is overlapped part in two sets….
> >  
> > from the code, it does sampling for twice for the same dataset
> >  
> > @Experimental
> > def kFold[T: ClassTag](rdd: RDD[T], numFolds: Int, seed: Int):
> > Array[(RDD[T], RDD[T])] = {
> > val numFoldsF = numFolds.toFloat
> > (1 to numFolds).map { fold =>
> > val sampler = new BernoulliSampler[T]((fold - 1) / numFoldsF, fold /
> > numFoldsF,
> > complement = false)
> > val validation = new PartitionwiseSampledRDD(rdd, sampler, true, seed)
> > val training = new PartitionwiseSampledRDD(rdd,
> > sampler.cloneComplement(), true, seed)
> > (training, validation)
> > }.toArray
> > }
> >  
> > the sampler is complement, there is still possibility to generate overlapped
> > training and validation set
> >  
> > because the sampling method looks like :
> >  
> > override def sample(items: Iterator[T]): Iterator[T] = {
> > items.filter { item =>
> > val x = rng.nextDouble()
> > (x >= lb && x < ub) ^ complement
> > }
> > }
> >  
> > I’m not a machine learning guy, so I guess I must fall into one of the
> > following three situations
> >  
> > 1. does it mean actually we allow overlapped training and validation set ?
> > (counter intuitive to me)
> >  
> > 2. I had some misunderstanding on the code?
> >  
> > 3. it’s a bug?
> >  
> > Anyone can explain it to me?
> >  
> > Best,
> >  
> > --
> > Nan Zhu
> >  
>  
>  
>  



Mime
View raw message