spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Joseph Bradley <>
Subject Re: Stochastic gradient descent performance
Date Thu, 02 Apr 2015 17:50:52 GMT
It looks like SPARK-3250 was applied to the sample() which GradientDescent
uses, and that should kick in for your minibatchFraction <= 0.4.  Based on
your numbers, aggregation seems like the main issue, though I hesitate to
optimize aggregation based on local tests for data sizes that small.

The first thing I'd check for is unnecessary object creation, and to
profile in a cluster or larger data setting.

On Wed, Apr 1, 2015 at 10:09 AM, Ulanov, Alexander <>

> Sorry for bothering you again, but I think that it is an important issue
> for applicability of SGD in Spark MLlib. Could Spark developers please
> comment on it.
> -----Original Message-----
> From: Ulanov, Alexander
> Sent: Monday, March 30, 2015 5:00 PM
> To:
> Subject: Stochastic gradient descent performance
> Hi,
> It seems to me that there is an overhead in "runMiniBatchSGD" function of
> MLlib's "GradientDescent". In particular, "sample" and "treeAggregate"
> might take time that is order of magnitude greater than the actual gradient
> computation. In particular, for mnist dataset of 60K instances, minibatch
> size = 0.001 (i.e. 60 samples) it take 0.15 s to sample and 0.3 to
> aggregate in local mode with 1 data partition on Core i5 processor. The
> actual gradient computation takes 0.002 s. I searched through Spark Jira
> and found that there was recently an update for more efficient sampling
> (SPARK-3250) that is already included in Spark codebase. Is there a way to
> reduce the sampling time and local treeRedeuce by order of magnitude?
> Best regards, Alexander
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message