systemml-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matthias Boehm <mboe...@googlemail.com>
Subject Re: Spark Core
Date Thu, 13 Jul 2017 19:22:09 GMT
thanks for sharing the setup; since you're running through MLContext you
implicitly use our default execution mode which is hybrid_spark - so you
can ignore my respective comment.

This execution type simply means that you allow the compiler to generate
hybrid execution plans composed of single-node in-memory operations and
distributed operations on spark. The decisions are primarily based on your
memory budgets and the size of inputs and intermediates of your particular
script. By using '-explain' you will see that each instruction is prefixed
with the execution type (CP or SPARK), which shows you if they will run in
the driver (CP stands for control program) or as distributed operations.
When using spark submit you can set this execution mode via '-mode <type>'.

Regards,
Matthias

On Thu, Jul 13, 2017 at 8:23 AM, arijit chakraborty <akc14@hotmail.com>
wrote:

> Thank a lot Matthias for your guidance!
>
>
> But I've issue with "hybrid spark mode". I tried to look into systemML
> documents as well as googling. But I didn't come across this specification.
> Will you please help me to how to setup hybrid spark mode.
>
>
> As present (for our convenience), we are setting spark requirement as
> follows. But you can suggest us using spark config file setup as well.
>
>
> import os
> import sys
> import pandas as pd
> import numpy as np
>
> spark_path = "C:\spark"
> os.environ['SPARK_HOME'] = spark_path
> os.environ['HADOOP_HOME'] = spark_path
>
> sys.path.append(spark_path + "/bin")
> sys.path.append(spark_path + "/python")
> sys.path.append(spark_path + "/python/pyspark/")
> sys.path.append(spark_path + "/python/lib")
> sys.path.append(spark_path + "/python/lib/pyspark.zip")
> sys.path.append(spark_path + "/python/lib/py4j-0.10.4-src.zip")
>
> from pyspark import SparkContext
> from pyspark import SparkConf
>
> sc = SparkContext("local[*]", "test")
>
>
> # SystemML Specifications:
>
>
> from pyspark.sql import SQLContext
> import systemml as sml
> sqlCtx = SQLContext(sc)
> ml = sml.MLContext(sc)
>
>
>
> Thank you!
>
> Arijit
>
> ________________________________
> From: Matthias Boehm <mboehm7@googlemail.com>
> Sent: Thursday, July 13, 2017 1:37:29 AM
> To: dev@systemml.apache.org
> Subject: Re: Spark Core
>
> Well, we explicitly cleanup all intermediates that are no longer used. You
> can use -explain to output the runtime plan, which includes rmvar (remove
> variable), cpvar (copy variable) and mvvar (move variable) instructions
> that internally cleanup intermediates. This cleanup removes data from
> memory, potentially evicted/exported variables, and created broadcasts and
> rdds. However, we also keep lineage to guard against eager broadcast/rdd
> cleanup if they are still used by other lazily evaluated rdds, but whenever
> an rdd is not referenced anymore, we cleanup its inputs.
>
> Regarding the comparison to R, please ensure you are running in
> hybrid_spark and not forced spark execution mode. Otherwise the latency of
> distributed jobs might dominate the execution time for operations over
> small data. Also, note that the spark write to csv currently requires a
> sort (and hence shuffle) to create the correct order of rows in the output
> files. If you want to read this later into SystemML again, you would be
> better off writing to text or binary.
>
> Regards,
> Matthias
>
> On Wed, Jul 12, 2017 at 11:44 AM, arijit chakraborty <akc14@hotmail.com>
> wrote:
>
> > Hi,
> >
> >
> > Suppose I've this following code:
> >
> >
> > a = matrix(seq(1,10), 10,1)
> >
> >
> > for(i in 1:100){
> >
> >   b = a + 10
> >
> >   write (b, "path" + ".csv", format="csv")
> >
> > }
> >
> >
> > So what I'm doing is for 100 items, I'm adding a constant to a matrix
> than
> > outputting it. And this operation occurs in spark using multiple core of
> > the system.
> >
> >
> > My question is, after the operation is the value (here b) remains in that
> > core (memory) of the system, so that it get piled up in the memory. Will
> > this affect the performance of the process? If it is, how to clean the
> > memory after each execution of loop?
> >
> >
> > The reason for asking the question is, when I'm testing the code in R the
> > performance is much better than systemML. Since R to systemML is almost
> > one-to-one mapping, so I'm not sure where I'm making the mistake. And
> > unfortunately at the stage of progress I can't share the exact code.
> >
> >
> > Thanks you!
> >
> > Arijit
> >
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message