systemml-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From arijit chakraborty <>
Subject Re: Spark Core
Date Thu, 13 Jul 2017 15:23:10 GMT
Thank a lot Matthias for your guidance!

But I've issue with "hybrid spark mode". I tried to look into systemML documents as well as
googling. But I didn't come across this specification. Will you please help me to how to setup
hybrid spark mode.

As present (for our convenience), we are setting spark requirement as follows. But you can
suggest us using spark config file setup as well.

import os
import sys
import pandas as pd
import numpy as np

spark_path = "C:\spark"
os.environ['SPARK_HOME'] = spark_path
os.environ['HADOOP_HOME'] = spark_path

sys.path.append(spark_path + "/bin")
sys.path.append(spark_path + "/python")
sys.path.append(spark_path + "/python/pyspark/")
sys.path.append(spark_path + "/python/lib")
sys.path.append(spark_path + "/python/lib/")
sys.path.append(spark_path + "/python/lib/")

from pyspark import SparkContext
from pyspark import SparkConf

sc = SparkContext("local[*]", "test")

# SystemML Specifications:

from pyspark.sql import SQLContext
import systemml as sml
sqlCtx = SQLContext(sc)
ml = sml.MLContext(sc)

Thank you!


From: Matthias Boehm <>
Sent: Thursday, July 13, 2017 1:37:29 AM
Subject: Re: Spark Core

Well, we explicitly cleanup all intermediates that are no longer used. You
can use -explain to output the runtime plan, which includes rmvar (remove
variable), cpvar (copy variable) and mvvar (move variable) instructions
that internally cleanup intermediates. This cleanup removes data from
memory, potentially evicted/exported variables, and created broadcasts and
rdds. However, we also keep lineage to guard against eager broadcast/rdd
cleanup if they are still used by other lazily evaluated rdds, but whenever
an rdd is not referenced anymore, we cleanup its inputs.

Regarding the comparison to R, please ensure you are running in
hybrid_spark and not forced spark execution mode. Otherwise the latency of
distributed jobs might dominate the execution time for operations over
small data. Also, note that the spark write to csv currently requires a
sort (and hence shuffle) to create the correct order of rows in the output
files. If you want to read this later into SystemML again, you would be
better off writing to text or binary.


On Wed, Jul 12, 2017 at 11:44 AM, arijit chakraborty <>

> Hi,
> Suppose I've this following code:
> a = matrix(seq(1,10), 10,1)
> for(i in 1:100){
>   b = a + 10
>   write (b, "path" + ".csv", format="csv")
> }
> So what I'm doing is for 100 items, I'm adding a constant to a matrix than
> outputting it. And this operation occurs in spark using multiple core of
> the system.
> My question is, after the operation is the value (here b) remains in that
> core (memory) of the system, so that it get piled up in the memory. Will
> this affect the performance of the process? If it is, how to clean the
> memory after each execution of loop?
> The reason for asking the question is, when I'm testing the code in R the
> performance is much better than systemML. Since R to systemML is almost
> one-to-one mapping, so I'm not sure where I'm making the mistake. And
> unfortunately at the stage of progress I can't share the exact code.
> Thanks you!
> Arijit

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message