spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Weichen Xu <weichen...@databricks.com>
Subject Re: Applying a Java script to many files: Java API or also Python API?
Date Fri, 29 Sep 2017 09:34:57 GMT
Although python can launch subprocess to run java code, but in PySpark, the
processing code which need to run parallelly in cluster, have to be written
in python, for example, in PySpark:

def f(x):
    ...
rdd.map(f)  // The function `f` must be pure python code

If you try to launch subprocess to run java code in function `f`, it will
bring large overhead and many other issues.

On Thu, Sep 28, 2017 at 5:36 PM, Giuseppe Celano <
celano@informatik.uni-leipzig.de> wrote:

> Hi,
>
> What I meant is that I could run the Java script using the subprocess
> module in Python. In that case is any difference (from directly coding in
> the Java API)  in performance expected? Thanks.
>
>
>
> On Sep 28, 2017, at 3:32 AM, Weichen Xu <weichen.xu@databricks.com> wrote:
>
> I think you have to use Spark Java API, in PySpark, functions running on
> spark executors (such as map function) can only written in python.
>
> On Thu, Sep 28, 2017 at 12:48 AM, Giuseppe Celano <celano@informatik.uni-
> leipzig.de> wrote:
>
>> Hi everyone,
>>
>> I would like to apply a java script to many files in parallel. I am
>> wondering whether I should definitely use the Spark Java API, or I could
>> also run the script using the Python API (with which I am more familiar
>> with), without this affecting performance. Thanks.
>>
>> Giuseppe
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>>
>
>

Mime
View raw message