spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andy Zhang <andrewmzh...@berkeley.edu>
Subject Spark TreeAggregate Slow LogisticRegressionWithSGD
Date Fri, 09 Feb 2018 22:05:39 GMT
Hello,


I am running logistic regression with SGD (using class
LogisticRegressionWithSGD) on a large libsvm file (Kaggle Criteo
dataset[1]). The file is about 10 GB in size with 40 million training
examples. My code is set up to run minibatches of ~10 examples.

When I run the scala code, the event timelime in Spark UI shows a lot of
time being spent running treeAggregate within GradientDescent. In fact,
treeAggregate takes about 8 minutes to calculate the loss of about 10
examples. Why does this take so long and what can be done to make this less
time consuming? I have attached a screenshot of the event timeline. Here is
a link to the image: https://github.com/andrewmzhang/spark-lr/blob/
master/Screenshot%20from%202018-02-08%2021-56-20.png

If I try running Spark on 1 master 1 slave with 8 cores each, treeAggregate
ends up taking 80 seconds. However, with a stepSize of 0.01, the logistic
loss does not converge, and instead stays at around 0.67. Since the weights
don't converge past a logloss of 0.67, Spark ends up stopping the training
loop early. Why does Spark not converge? Is there anything wrong with my
settings?


Here is the code I am using https://github.com/andrewmzhang/spark-lr/blob/
master/test.scala. This repo also contains the preprocessing code, commands
to run the script, and the log from a previous run. I have the dataset on
disk, and I am running Spark locally on a single node with 1 core. I have
tried using 1 and 4 partitions, but that does not seem to help.

1. https://s3-eu-west-1.amazonaws.com/criteo-labs/dac.tar.gz



Regards,

Andrew Zhang

-- 
Class of 2020

Mime
View raw message