spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From olivierjeunen <>
Subject Spark's Logistic Regression runs unstable on Yarn cluster
Date Fri, 12 Aug 2016 10:08:55 GMT
I'm using pyspark ML's logistic regression implementation to do some
classification on an AWS EMR Yarn cluster.

The cluster consists of 10 m3.xlarge nodes and is set up as follows:
spark.driver.memory 10g, spark.driver.cores  3 , spark.executor.memory 10g,
spark.executor-cores 4.

I enabled yarn's dynamic allocation abilities.

The problem is that my results are way unstable. Sometimes my application
finishes using 13 executors total, sometimes all of them seem to die and the
application ends up using anywhere between 100 and 200...

Any insight on what could cause this stochastic behaviour would be greatly

The code used to run the logistic regression:

data =
lr = LogisticRegression()
evaluator = BinaryClassificationEvaluator()
lrModel = == 0))
predictions = lrModel.transform(data.filter(data.test == 1))
auROC = evaluator.evaluate(predictions)
print "auROC on test set: ", auROC
Data is a dataframe of roughly 2.8GB

View this message in context:
Sent from the Apache Spark User List mailing list archive at

To unsubscribe e-mail:

View raw message