spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Adam Roberts <AROBE...@uk.ibm.com>
Subject Re: Understanding pyspark data flow on worker nodes
Date Fri, 08 Jul 2016 10:09:47 GMT
Hi, sharing what I discovered with PySpark too, corroborates with what 
Amit notices and also interested in the pipe question:
h
ttps://mail-archives.apache.org/mod_mbox/spark-dev/201603.mbox/%3C201603291521.u2TFLBfO024212@d06av05.portsmouth.uk.ibm.com%3E


// Start a thread to feed the process input from our parent's iterator 
  val writerThread = new WriterThread(env, worker, inputIterator, 
partitionIndex, context)

...

// Return an iterator that read lines from the process's stdout 
  val stream = new DataInputStream(new 
BufferedInputStream(worker.getInputStream, bufferSize))

The above code and what follows look to be the important parts.



Note that Josh Rosen replied to my comment with more information:

"One clarification: there are Python interpreters running on executors so 
that Python UDFs and RDD API code can be executed. Some slightly-outdated 
but mostly-correct reference material for this can be found at 
https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals. 

See also: search the Spark codebase for PythonRDD and look at 
python/pyspark/worker.py"




From:   Reynold Xin <rxin@databricks.com>
To:     Amit Rana <amitranavsr94@gmail.com>
Cc:     "dev@spark.apache.org" <dev@spark.apache.org>
Date:   08/07/2016 07:03
Subject:        Re: Understanding pyspark data flow on worker nodes



You can look into its source code: 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala


On Thu, Jul 7, 2016 at 11:01 PM, Amit Rana <amitranavsr94@gmail.com> 
wrote:
Hi all,
Did anyone get a chance to look into it??
Any sort of guidance will be much appreciated.
Thanks,
Amit Rana
On 7 Jul 2016 14:28, "Amit Rana" <amitranavsr94@gmail.com> wrote:
As mentioned in the documentation:
PythonRDD objects launch Python subprocesses and communicate with them 
using pipes, sending the user's code and the data to be processed.
I am trying to understand  the implementation of how this data transfer is 
happening  using pipes.
Can anyone please guide me along that line??
Thanks, 
Amit Rana
On 7 Jul 2016 13:44, "Sun Rui" <sunrise_win@163.com> wrote:
You can read 
https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals
For pySpark data flow on worker nodes, you can read the source code of 
PythonRDD.scala. Python worker processes communicate with Spark executors 
via sockets instead of pipes.

On Jul 7, 2016, at 15:49, Amit Rana <amitranavsr94@gmail.com> wrote:

Hi all,
I am trying  to trace the data flow in pyspark. I am using intellij IDEA 
in windows 7.
I had submitted  a python  job as follows:
--master local[4] <path to pyspark  job> <arguments to the job>
I have made the following  insights after running the above command in 
debug mode:
->Locally when a pyspark's interpreter starts, it also starts a JVM with 
which it communicates through socket.
->py4j is used to handle this communication 
->Now this JVM acts as actual spark driver, and loads a JavaSparkContext 
which communicates with the spark executors in cluster.
In cluster I have read that the data flow between spark executors and 
python interpreter happens using pipes. But I am not able to trace that 
data flow.
Please correct me if my understanding is wrong. It would be very helpful 
if, someone can help me understand tge code-flow for data transfer between 
JVM and python workers.
Thanks,
Amit Rana



Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

Mime
View raw message