spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Amit Rana <amitranavs...@gmail.com>
Subject Re: Understanding pyspark data flow on worker nodes
Date Thu, 07 Jul 2016 08:58:22 GMT
As mentioned in the documentation:
PythonRDD objects launch Python subprocesses and communicate with them
using pipes, sending the user's code and the data to be processed.

I am trying to understand  the implementation of how this data transfer is
happening  using pipes.
Can anyone please guide me along that line??

Thanks,
Amit Rana
On 7 Jul 2016 13:44, "Sun Rui" <sunrise_win@163.com> wrote:

> You can read
> https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals
> For pySpark data flow on worker nodes, you can read the source code of
> PythonRDD.scala. Python worker processes communicate with Spark executors
> via sockets instead of pipes.
>
> On Jul 7, 2016, at 15:49, Amit Rana <amitranavsr94@gmail.com> wrote:
>
> Hi all,
>
> I am trying  to trace the data flow in pyspark. I am using intellij IDEA
> in windows 7.
> I had submitted  a python  job as follows:
> --master local[4] <path to pyspark  job> <arguments to the job>
>
> I have made the following  insights after running the above command in
> debug mode:
> ->Locally when a pyspark's interpreter starts, it also starts a JVM with
> which it communicates through socket.
> ->py4j is used to handle this communication
> ->Now this JVM acts as actual spark driver, and loads a JavaSparkContext
> which communicates with the spark executors in cluster.
>
> In cluster I have read that the data flow between spark executors and
> python interpreter happens using pipes. But I am not able to trace that
> data flow.
>
> Please correct me if my understanding is wrong. It would be very helpful
> if, someone can help me understand tge code-flow for data transfer between
> JVM and python workers.
>
> Thanks,
> Amit Rana
>
>
>

Mime
View raw message