spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Charles Li <>
Subject Questions about disk IOs
Date Tue, 01 Jul 2014 07:55:46 GMT
Hi Spark,

I am running LBFGS on our user data. The data size with Kryo serialisation is about 210G.
The weight size is around 1,300,000. I am quite confused that the performance is very close
whether the data is cached or not.

The program is simple:
points = sc.hadoopFIle(int, SequenceFileInputFormat.class …..)
points.persist(StorageLevel.Memory_AND_DISK_SER()) // comment it if not cached
gradient = new LogisticGrandient();
updater = new SquaredL2Updater();
initWeight = Vectors.sparse(size, new int[]{}, new double[]{})
result = LBFGS.runLBFGS(points.rdd(), grandaunt, updater, numCorrections, convergeTol, maxIter,
regParam, initWeight);

I have 13 machines with 16 cpus, 48G RAM each. Spark is running on its cluster mode. Below
are some arguments I am using:
—executor-memory 10G
—num-executors 50
—executor-cores 2

Storage Using:
When caching:
Cached Partitions 951
Fraction Cached 100%
Size in Memory 215.7GB
Size in Tachyon 0.0B
Size on Disk 1029.7MB

The time cost by every aggregate is around 5 minutes with cache enabled. Lots of disk IOs
can be seen on the hadoop node. I have the same result with cache disabled.

Should data points caching improve the performance? Should caching decrease the disk IO?

Thanks in advance.
View raw message