spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andy Zhang <>
Subject Spark TreeAggregate Slow LogisticRegressionWithSGD
Date Fri, 09 Feb 2018 22:05:39 GMT

I am running logistic regression with SGD (using class
LogisticRegressionWithSGD) on a large libsvm file (Kaggle Criteo
dataset[1]). The file is about 10 GB in size with 40 million training
examples. My code is set up to run minibatches of ~10 examples.

When I run the scala code, the event timelime in Spark UI shows a lot of
time being spent running treeAggregate within GradientDescent. In fact,
treeAggregate takes about 8 minutes to calculate the loss of about 10
examples. Why does this take so long and what can be done to make this less
time consuming? I have attached a screenshot of the event timeline. Here is
a link to the image:

If I try running Spark on 1 master 1 slave with 8 cores each, treeAggregate
ends up taking 80 seconds. However, with a stepSize of 0.01, the logistic
loss does not converge, and instead stays at around 0.67. Since the weights
don't converge past a logloss of 0.67, Spark ends up stopping the training
loop early. Why does Spark not converge? Is there anything wrong with my

Here is the code I am using
master/test.scala. This repo also contains the preprocessing code, commands
to run the script, and the log from a previous run. I have the dataset on
disk, and I am running Spark locally on a single node with 1 core. I have
tried using 1 and 4 partitions, but that does not seem to help.



Andrew Zhang

Class of 2020

View raw message