spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rémi Delassus (JIRA) <>
Subject [jira] [Commented] (SPARK-17055) add labelKFold to CrossValidator
Date Fri, 28 Oct 2016 07:15:59 GMT


Rémi Delassus commented on SPARK-17055:

I had an issue that could be solved by this kind of technique : Each sample got a timesamp
and the error can only be computed on a full month of data. Thus I need months of data to
be distributed in folds, no each sample individually.

But in my opinion this is not the good way to solve that. Since there is an infinite number
of ways to split the data, I think we should be able to pass the split method as an argument
to the crossvalidator. The method described here could be implemented, as well as any other.

> add labelKFold to CrossValidator
> --------------------------------
>                 Key: SPARK-17055
>                 URL:
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Vincent
>            Priority: Minor
> Current CrossValidator only supports k-fold, which randomly divides all the samples in
k groups of samples. But in cases when data is gathered from different subjects and we want
to avoid over-fitting, we want to hold out samples with certain labels from training data
and put them into validation fold, i.e. we want to ensure that the same label is not in both
testing and training sets.
> Mainstream packages like Sklearn already supports such cross validation method. (

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message