systemml-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Rajarshi Bhadra <bhadrarajars...@gmail.com>
Subject Implementation of Parallelized process in Standalone Spark Cluster using SystemML
Date Wed, 26 Jul 2017 11:32:51 GMT
Hi,

I have been using SystemML for sometime and I am finding it extremely
useful for scaling up my algorithm using Spark. However there area few
aspects which I am fully not understanding and would like to have some
clarification

My System Configuration: 244gb RAM, 32 Cores.
My spark Configuration: 'spark.executor.cores', '4'
                                       'spark.driver.memory', '80g'
                                       'spark.executor.memory', '20g'
                                       'spark.memory.fraction', '0.75'
                                       'spark.worker.cleanup.enabled',
'true'
                                       'spark.default.parallelism','1'

I have a process in R which I am trying to implement. The process is
similar to randomForest involving growing trees. Now The way the process is
in R I parallelize it using the parLapply statement where n trees are grown
in n parallel processes. I have implemented the algorithm in an identical
way and tried running it using parfor loop. There are two main issues I am
facing

1. In R using ncore = 16 i get 30 trees in 10 mins but in spark via
systemml the process is taking 1 hour.
2. Also I have noticed that if one tree takes  2 mins to run 5 trees take
7-8 mins to run. It seems to me I am unable to parallelize the process by
trees in SystemML

It would be great if someone can help me out with this

Thank you
Rajarshi

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message