spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xiangrui Meng <men...@gmail.com>
Subject Re: Questions about disk IOs
Date Tue, 01 Jul 2014 16:08:17 GMT
Try to reduce number of partitions to match the number of cores. We
will add treeAggregate to reduce the communication cost.

PR: https://github.com/apache/spark/pull/1110

-Xiangrui

On Tue, Jul 1, 2014 at 12:55 AM, Charles Li <littlee1032@gmail.com> wrote:
> Hi Spark,
>
> I am running LBFGS on our user data. The data size with Kryo serialisation is about 210G.
The weight size is around 1,300,000. I am quite confused that the performance is very close
whether the data is cached or not.
>
> The program is simple:
> points = sc.hadoopFIle(int, SequenceFileInputFormat.class …..)
> points.persist(StorageLevel.Memory_AND_DISK_SER()) // comment it if not cached
> gradient = new LogisticGrandient();
> updater = new SquaredL2Updater();
> initWeight = Vectors.sparse(size, new int[]{}, new double[]{})
> result = LBFGS.runLBFGS(points.rdd(), grandaunt, updater, numCorrections, convergeTol,
maxIter, regParam, initWeight);
>
> I have 13 machines with 16 cpus, 48G RAM each. Spark is running on its cluster mode.
Below are some arguments I am using:
> —executor-memory 10G
> —num-executors 50
> —executor-cores 2
>
> Storage Using:
> When caching:
> Cached Partitions 951
> Fraction Cached 100%
> Size in Memory 215.7GB
> Size in Tachyon 0.0B
> Size on Disk 1029.7MB
>
> The time cost by every aggregate is around 5 minutes with cache enabled. Lots of disk
IOs can be seen on the hadoop node. I have the same result with cache disabled.
>
> Should data points caching improve the performance? Should caching decrease the disk
IO?
>
> Thanks in advance.

Mime
View raw message