spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ulanov, Alexander" <>
Subject Profiling Spark: MemoryStore
Date Fri, 13 Mar 2015 00:34:40 GMT

I am working on artificial neural networks for Spark. It is solved with Gradient Descent,
so each step the data is read, sum of gradients is calculated for each data partition (on
each worker), aggregated (on the driver) and broadcasted back. I noticed that the gradient
computation time is few times less than the total time needed for each step. To narrow down
my observation, I run the gradient on a single machine with single partition of data of site
100MB that I persist (data.persist). This should minimize the overhead for aggregation at
least, but the gradient computation still takes much less time than the whole step. Just in
case, data is loaded by MLUtil. loadLibSVMFile in RDD[LabeledPoint], this is my code:

    val conf = new SparkConf().setAppName("myApp").setMaster("local[2]")
    val train = MLUtils.loadLibSVMFile(new SparkContext(conf), "/data/mnist/mnist.scale").repartition(1).persist()
    val model = ANN2Classifier.train(train, 1000, Array[Int](32), 10, 1e-4) //training data,
batch size, hidden layer size, iterations, LBFGS tolerance

Profiler shows that there are two threads, one is doing Gradient and the other I don't know
what. The Gradient takes 10% of this thread. Almost all other time is spent by MemoryStore.
Below is the screenshot (first thread):
Second thread:

Could Spark developers please elaborate what's going on in MemoryStore? It seems that it does
some string operations (parsing libsvm file? Why every step?) and a lot of InputStream reading.
It seems that the overall time depends on the size of the data batch (or size of vector) I
am processing. However it does not seems linear to me.

Also, I would like to know how to speedup these operations.

Best regards, Alexander

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message