spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Davies Liu <dav...@databricks.com>
Subject Re: PySpark on PyPi
Date Thu, 06 Aug 2015 22:14:51 GMT
We could do that after 1.5 released, it will have same release cycle
as Spark in the future.

On Tue, Jul 28, 2015 at 5:52 AM, Olivier Girardot
<o.girardot@lateral-thoughts.com> wrote:
> +1 (once again :) )
>
> 2015-07-28 14:51 GMT+02:00 Justin Uang <justin.uang@gmail.com>:
>>
>> // ping
>>
>> do we have any signoff from the pyspark devs to submit a PR to publish to
>> PyPI?
>>
>> On Fri, Jul 24, 2015 at 10:50 PM Jeremy Freeman <freeman.jeremy@gmail.com>
>> wrote:
>>>
>>> Hey all, great discussion, just wanted to +1 that I see a lot of value in
>>> steps that make it easier to use PySpark as an ordinary python library.
>>>
>>> You might want to check out this (https://github.com/minrk/findspark),
>>> started by Jupyter project devs, that offers one way to facilitate this
>>> stuff. I’ve also cced them here to join the conversation.
>>>
>>> Also, @Jey, I can also confirm that at least in some scenarios (I’ve done
>>> it in an EC2 cluster in standalone mode) it’s possible to run PySpark jobs
>>> just using `from pyspark import SparkContext; sc = SparkContext(master=“X”)`
>>> so long as the environmental variables (PYTHONPATH and PYSPARK_PYTHON) are
>>> set correctly on *both* workers and driver. That said, there’s definitely
>>> additional configuration / functionality that would require going through
>>> the proper submit scripts.
>>>
>>> On Jul 22, 2015, at 7:41 PM, Punyashloka Biswal <punya.biswal@gmail.com>
>>> wrote:
>>>
>>> I agree with everything Justin just said. An additional advantage of
>>> publishing PySpark's Python code in a standards-compliant way is the fact
>>> that we'll be able to declare transitive dependencies (Pandas, Py4J) in a
>>> way that pip can use. Contrast this with the current situation, where
>>> df.toPandas() exists in the Spark API but doesn't actually work until you
>>> install Pandas.
>>>
>>> Punya
>>> On Wed, Jul 22, 2015 at 12:49 PM Justin Uang <justin.uang@gmail.com>
>>> wrote:
>>>>
>>>> // + Davies for his comments
>>>> // + Punya for SA
>>>>
>>>> For development and CI, like Olivier mentioned, I think it would be
>>>> hugely beneficial to publish pyspark (only code in the python/ dir) on PyPI.
>>>> If anyone wants to develop against PySpark APIs, they need to download the
>>>> distribution and do a lot of PYTHONPATH munging for all the tools (pylint,
>>>> pytest, IDE code completion). Right now that involves adding python/ and
>>>> python/lib/py4j-0.8.2.1-src.zip. In case pyspark ever wants to add more
>>>> dependencies, we would have to manually mirror all the PYTHONPATH munging
in
>>>> the ./pyspark script. With a proper pyspark setup.py which declares its
>>>> dependencies, and a published distribution, depending on pyspark will just
>>>> be adding pyspark to my setup.py dependencies.
>>>>
>>>> Of course, if we actually want to run parts of pyspark that is backed by
>>>> Py4J calls, then we need the full spark distribution with either ./pyspark
>>>> or ./spark-submit, but for things like linting and development, the
>>>> PYTHONPATH munging is very annoying.
>>>>
>>>> I don't think the version-mismatch issues are a compelling reason to not
>>>> go ahead with PyPI publishing. At runtime, we should definitely enforce that
>>>> the version has to be exact, which means there is no backcompat nightmare
as
>>>> suggested by Davies in https://issues.apache.org/jira/browse/SPARK-1267.
>>>> This would mean that even if the user got his pip installed pyspark to
>>>> somehow get loaded before the spark distribution provided pyspark, then the
>>>> user would be alerted immediately.
>>>>
>>>> Davies, if you buy this, should me or someone on my team pick up
>>>> https://issues.apache.org/jira/browse/SPARK-1267 and
>>>> https://github.com/apache/spark/pull/464?
>>>>
>>>> On Sat, Jun 6, 2015 at 12:48 AM Olivier Girardot
>>>> <o.girardot@lateral-thoughts.com> wrote:
>>>>>
>>>>> Ok, I get it. Now what can we do to improve the current situation,
>>>>> because right now if I want to set-up a CI env for PySpark, I have to
:
>>>>> 1- download a pre-built version of pyspark and unzip it somewhere on
>>>>> every agent
>>>>> 2- define the SPARK_HOME env
>>>>> 3- symlink this distribution pyspark dir inside the python install dir
>>>>> site-packages/ directory
>>>>> and if I rely on additional packages (like databricks' Spark-CSV
>>>>> project), I have to (except if I'm mistaken)
>>>>> 4- compile/assembly spark-csv, deploy the jar in a specific directory
>>>>> on every agent
>>>>> 5- add this jar-filled directory to the Spark distribution's additional
>>>>> classpath using the conf/spark-default file
>>>>>
>>>>> Then finally we can launch our unit/integration-tests.
>>>>> Some issues are related to spark-packages, some to the lack of
>>>>> python-based dependency, and some to the way SparkContext are launched
when
>>>>> using pyspark.
>>>>> I think step 1 and 2 are fair enough
>>>>> 4 and 5 may already have solutions, I didn't check and considering
>>>>> spark-shell is downloading such dependencies automatically, I think if
>>>>> nothing's done yet it will (I guess ?).
>>>>>
>>>>> For step 3, maybe just adding a setup.py to the distribution would be
>>>>> enough, I'm not exactly advocating to distribute a full 300Mb spark
>>>>> distribution in PyPi, maybe there's a better compromise ?
>>>>>
>>>>> Regards,
>>>>>
>>>>> Olivier.
>>>>>
>>>>> Le ven. 5 juin 2015 à 22:12, Jey Kottalam <jey@cs.berkeley.edu>
a écrit
>>>>> :
>>>>>>
>>>>>> Couldn't we have a pip installable "pyspark" package that just serves
>>>>>> as a shim to an existing Spark installation? Or it could even download
the
>>>>>> latest Spark binary if SPARK_HOME isn't set during installation.
Right now,
>>>>>> Spark doesn't play very well with the usual Python ecosystem. For
example,
>>>>>> why do I need to use a strange incantation when booting up IPython
if I want
>>>>>> to use PySpark in a notebook with MASTER="local[4]"? It would be
much nicer
>>>>>> to just type `from pyspark import SparkContext; sc =
>>>>>> SparkContext("local[4]")` in my notebook.
>>>>>>
>>>>>> I did a test and it seems like PySpark's basic unit-tests do pass
when
>>>>>> SPARK_HOME is set and Py4J is on the PYTHONPATH:
>>>>>>
>>>>>>
>>>>>> PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH
>>>>>> python $SPARK_HOME/python/pyspark/rdd.py
>>>>>>
>>>>>> -Jey
>>>>>>
>>>>>>
>>>>>> On Fri, Jun 5, 2015 at 10:57 AM, Josh Rosen <rosenville@gmail.com>
>>>>>> wrote:
>>>>>>>
>>>>>>> This has been proposed before:
>>>>>>> https://issues.apache.org/jira/browse/SPARK-1267
>>>>>>>
>>>>>>> There's currently tighter coupling between the Python and Java
halves
>>>>>>> of PySpark than just requiring SPARK_HOME to be set; if we did
this, I bet
>>>>>>> we'd run into tons of issues when users try to run a newer version
of the
>>>>>>> Python half of PySpark against an older set of Java components
or
>>>>>>> vice-versa.
>>>>>>>
>>>>>>> On Thu, Jun 4, 2015 at 10:45 PM, Olivier Girardot
>>>>>>> <o.girardot@lateral-thoughts.com> wrote:
>>>>>>>>
>>>>>>>> Hi everyone,
>>>>>>>> Considering the python API as just a front needing the SPARK_HOME
>>>>>>>> defined anyway, I think it would be interesting to deploy
the Python part of
>>>>>>>> Spark on PyPi in order to handle the dependencies in a Python
project
>>>>>>>> needing PySpark via pip.
>>>>>>>>
>>>>>>>> For now I just symlink the python/pyspark in my python install
dir
>>>>>>>> site-packages/ in order for PyCharm or other lint tools to
work properly.
>>>>>>>> I can do the setup.py work or anything.
>>>>>>>>
>>>>>>>> What do you think ?
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>>
>>>>>>>> Olivier.
>>>>>>>
>>>>>>>
>>>>>>
>>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Mime
View raw message