spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <>
Subject Re: Does the kFold in Spark always give you the same split?
Date Fri, 30 Jan 2015 19:55:12 GMT
Are you using SGD for logistic regression? There's a random element
there too, by nature. I looked into the code and see that you can't
set a seed, but actually, the sampling is done with a fixed seed per
partition anyway. Hm.

In general you would not expect these algorithms to produce the same
result, given the stochastic nature. In this particular case, I'm not
sure if you can or should be able to get the implementation to act
deterministically. Even if the overt use of randomness is seed-able,
there may be some non-determinism in the distributed nature of the
processing that is having an effect.

On Fri, Jan 30, 2015 at 7:27 PM, Jianguo Li <> wrote:
> Thanks. I did specify a seed parameter.
> Seems that the problem is not caused by kFold. I actually ran another
> experiment without cross validation. I just built a model with the training
> data and then tested the model on the test data. However, the accuracy still
> varies from one run to another. Interestingly, this only happens when I ran
> the experiment on our cluster. If I ran the experiment on my local machine,
> I can reproduce the result each time. Has anybody encountered similar issue
> before?
> Thanks,
> Jianguo
> On Fri, Jan 30, 2015 at 11:22 AM, Sean Owen <> wrote:
>> Have a look at the source code for MLUtils.kFold. Yes, there is a
>> random element. That's good; you want the folds to be randomly chosen.
>> Note there is a seed parameter, as in a lot of the APIs, that lets you
>> fix the RNG seed and so get the same result every time, if you need
>> to.
>> On Fri, Jan 30, 2015 at 4:12 PM, Jianguo Li <>
>> wrote:
>> > Hi,
>> >
>> > I am using the utility function kFold provided in Spark for doing k-fold
>> > cross validation using logistic regression. However, each time I run the
>> > experiment, I got different different result. Since everything else
>> > stays
>> > constant, I was wondering if this is due to the kFold function I used.
>> > Does
>> > anyone know if the kFold gives you a different split on a data set each
>> > time
>> > you call it?
>> >
>> > Thanks,
>> >
>> > Jianguo

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message