Hi,
I'm creating a process in SystemML, and running it through spark. I'm running the code in
the following way:
# Spark Specifications:
import os
import sys
import pandas as pd
import numpy as np
spark_path = "C:\spark"
os.environ['SPARK_HOME'] = spark_path
os.environ['HADOOP_HOME'] = spark_path
sys.path.append(spark_path + "/bin")
sys.path.append(spark_path + "/python")
sys.path.append(spark_path + "/python/pyspark/")
sys.path.append(spark_path + "/python/lib")
sys.path.append(spark_path + "/python/lib/pyspark.zip")
sys.path.append(spark_path + "/python/lib/py4j-0.10.4-src.zip")
from pyspark import SparkContext
from pyspark import SparkConf
sc = SparkContext("local[*]", "test")
# SystemML Specifications:
from pyspark.sql import SQLContext
import systemml as sml
sqlCtx = SQLContext(sc)
ml = sml.MLContext(sc)
# Importing the data
train_data= pd.read_csv("data1.csv")
test_data = pd.read_csv("data2.csv")
train_data = sqlCtx.createDataFrame(pd.DataFrame(train_data))
test_data = sqlCtx.createDataFrame(pd.DataFrame(test_data))
# Finally executing the code:
scriptUrl = "C:/systemml-0.13.0-incubating-bin/scripts/model_code.dml"
script = sml.dml(scriptUrl).input(bdframe_train =train_data , bdframe_test = test_data).output("check_func")
beta = ml.execute(script).get("check_func").toNumPy()
pd.DataFrame(beta).head(1)
The datasize are 1000 & 100 rows for train and test respectively. I'm testing it on small
dataset during development. Later will test in larger dataset. I'm running on my local system
with 4 cores.
The problem is, if I run the model in R, it's taking fraction of second. But when I'm running
like this, it's taking around 20-30 seconds.
Could anyone please suggest me how to improve the execution speed? In case there are any other
way I can execute the code, which can improve the execution speed.
Also, thank you all you guyz for releasing the 0.14 version. There are fewimprovements we
found extremely helpful.
Thank you!
Arijit
|