spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andy Davidson <>
Subject understanding rdd pipe() and bin/spark-submit --master
Date Sat, 20 Sep 2014 19:51:26 GMT

I am new to spark and started writing some simple test code to figure out
how things works. I am very interested in spark streaming and python. It
appears that streaming is not supported in python yet.  The work around I
found by googling  is to write your streaming code in either Scala or Java
and use RDD pipe() to fetch the data into your python app. I do not think I
am getting parallel execution. In my current experiment I am using a mac
book pro with 8 cores. I wrote a java job that process a small data file on
disk, uses a couple of transformations  and writes the data to standard out.

I have a simple python script that calls pipe()

import sys

from operator import add

from pyspark import SparkContext

sc = SparkContext(appName="pySparkPipeJava")

data = [1, 2, 3, 4, 5]

distData = sc.parallelize(data)

output = distData.pipe("../bin/").collect()

# pipe() returns strings

for (num) in output:

    print "pySpark: value from java job %s" % (num)

I  submit the python job as follows
$ $SPARK_HOME/bin/spark-submit --master local[1]

Everything works as expected as long as I only use a single core. If I use 4
cores I get back 4 copies of data. My understanding is that the shell script
will execute on all the workers. In general I want all the transforms and
actions to run in parallel how ever I only want to process a single set of
data in my python script.

Here is the code for

$SPARK_HOME/bin/spark-submit \

  --class ³myJavaSrc" \

  --master local[4] \


What is really strange is if I replace with a simple shell
script that basically just echoes I can run with 4 cores and only get back
one set of data. Any idea what the difference is? (my java child does not
read anything from the python app, the echo script does)

I tried changing the number of cores in but that does not seem
to change how much data I get back

Seems like being limited to a single core would be severely limiting.

Here is the code for my ³echo² script



# Use this shell script to figure out how spark RDD pipe() works


#set -x # turns shell debugging on

#set +x # turns shell debugging off

while read x ; 


echo $x ;


Thanks in advance


View raw message