spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From westurner <wes.tur...@gmail.com>
Subject Re: PySpark on PyPi
Date Tue, 11 Aug 2015 21:31:53 GMT
westurner wrote
> 
> Matt Goodman wrote
>> I would tentatively suggest also conda packaging.
>> 
>> http://conda.pydata.org/docs/
> $ conda skeleton pypi pyspark
> # update git_tag and git_uri
> # add test commands (import pyspark; import pyspark.[...])
> 
> Docs for building conda packages for multiple operating systems and
> interpreters from PyPi packages:
> 
> *
> http://www.pydanny.com/building-conda-packages-for-multiple-operating-systems.html
> * https://github.com/audreyr/cookiecutter/issues/232

* conda meta.yaml can specify e.g. a test.sh script(s) that should return 0
 
  Docs: http://conda.pydata.org/docs/building/meta-yaml.html#test-section


Wes Turner wrote
> 
> Matt Goodman wrote
>> --Matthew Goodman
>> 
>> =====================
>> Check Out My Website: http://craneium.net
>> Find me on LinkedIn: http://tinyurl.com/d6wlch
>> 
>> On Mon, Aug 10, 2015 at 11:23 AM, Davies Liu &lt;

>> davies@

>> &gt; wrote:
>> 
>>> I think so, any contributions on this are welcome.
>>>
>>> On Mon, Aug 10, 2015 at 11:03 AM, Brian Granger &lt;

>> ellisonbg@

>> &gt;
>>> wrote:
>>> > Sorry, trying to follow the context here. Does it look like there is
>>> > support for the idea of creating a setup.py file and pypi package for
>>> > pyspark?
>>> >
>>> > Cheers,
>>> >
>>> > Brian
>>> >
>>> > On Thu, Aug 6, 2015 at 3:14 PM, Davies Liu &lt;

>> davies@

>> &gt;
>>> wrote:
>>> >> We could do that after 1.5 released, it will have same release cycle
>>> >> as Spark in the future.
>>> >>
>>> >> On Tue, Jul 28, 2015 at 5:52 AM, Olivier Girardot
>>> >> &lt;

>> o.girardot@

>> &gt; wrote:
>>> >>> +1 (once again :) )
>>> >>>
>>> >>> 2015-07-28 14:51 GMT+02:00 Justin Uang &lt;

>> justin.uang@

>> &gt;:
>>> >>>>
>>> >>>> // ping
>>> >>>>
>>> >>>> do we have any signoff from the pyspark devs to submit a PR
to
>>> publish to
>>> >>>> PyPI?
>>> >>>>
>>> >>>> On Fri, Jul 24, 2015 at 10:50 PM Jeremy Freeman <
>>> 

>> freeman.jeremy@

>>>
>>> >>>> wrote:
>>> >>>>>
>>> >>>>> Hey all, great discussion, just wanted to +1 that I see
a lot of
>>> value in
>>> >>>>> steps that make it easier to use PySpark as an ordinary
python
>>> library.
>>> >>>>>
>>> >>>>> You might want to check out this
>>> (https://github.com/minrk/findspark
>>> ),
>>> >>>>> started by Jupyter project devs, that offers one way to
facilitate
>>> this
>>> >>>>> stuff. I’ve also cced them here to join the conversation.
>>> >>>>>
>>> >>>>> Also, @Jey, I can also confirm that at least in some scenarios
>>> (I’ve
>>> done
>>> >>>>> it in an EC2 cluster in standalone mode) it’s possible
to run
>>> PySpark jobs
>>> >>>>> just using `from pyspark import SparkContext; sc =
>>> SparkContext(master=“X”)`
>>> >>>>> so long as the environmental variables (PYTHONPATH and
>>> PYSPARK_PYTHON) are
>>> >>>>> set correctly on *both* workers and driver. That said, there’s
>>> definitely
>>> >>>>> additional configuration / functionality that would require
going
>>> through
>>> >>>>> the proper submit scripts.
>>> >>>>>
>>> >>>>> On Jul 22, 2015, at 7:41 PM, Punyashloka Biswal <
>>> 

>> punya.biswal@

>>>
>>> >>>>> wrote:
>>> >>>>>
>>> >>>>> I agree with everything Justin just said. An additional
advantage
>>> of
>>> >>>>> publishing PySpark's Python code in a standards-compliant
way is
>>> the
>>> fact
>>> >>>>> that we'll be able to declare transitive dependencies (Pandas,
>>> Py4J)
>>> in a
>>> >>>>> way that pip can use. Contrast this with the current situation,
>>> where
>>> >>>>> df.toPandas() exists in the Spark API but doesn't actually
work
>>> until you
>>> >>>>> install Pandas.
>>> >>>>>
>>> >>>>> Punya
>>> >>>>> On Wed, Jul 22, 2015 at 12:49 PM Justin Uang &lt;

>> justin.uang@

>> &gt;
>>> >>>>> wrote:
>>> >>>>>>
>>> >>>>>> // + Davies for his comments
>>> >>>>>> // + Punya for SA
>>> >>>>>>
>>> >>>>>> For development and CI, like Olivier mentioned, I think
it would
>>> be
>>> >>>>>> hugely beneficial to publish pyspark (only code in the
python/
>>> dir)
>>> on PyPI.
>>> >>>>>> If anyone wants to develop against PySpark APIs, they
need to
>>> download the
>>> >>>>>> distribution and do a lot of PYTHONPATH munging for
all the tools
>>> (pylint,
>>> >>>>>> pytest, IDE code completion). Right now that involves
adding
>>> python/ and
>>> >>>>>> python/lib/py4j-0.8.2.1-src.zip. In case pyspark ever
wants to
>>> add
>>> more
>>> >>>>>> dependencies, we would have to manually mirror all the
PYTHONPATH
>>> munging in
>>> >>>>>> the ./pyspark script. With a proper pyspark setup.py
which
>>> declares
>>> its
>>> >>>>>> dependencies, and a published distribution, depending
on pyspark
>>> will just
>>> >>>>>> be adding pyspark to my setup.py dependencies.
>>> >>>>>>
>>> >>>>>> Of course, if we actually want to run parts of pyspark
that is
>>> backed by
>>> >>>>>> Py4J calls, then we need the full spark distribution
with either
>>> ./pyspark
>>> >>>>>> or ./spark-submit, but for things like linting and development,
>>> the
>>> >>>>>> PYTHONPATH munging is very annoying.
>>> >>>>>>
>>> >>>>>> I don't think the version-mismatch issues are a compelling
reason
>>> to not
>>> >>>>>> go ahead with PyPI publishing. At runtime, we should
definitely
>>> enforce that
>>> >>>>>> the version has to be exact, which means there is no
backcompat
>>> nightmare as
>>> >>>>>> suggested by Davies in
>>> https://issues.apache.org/jira/browse/SPARK-1267.
>>> >>>>>> This would mean that even if the user got his pip installed
>>> pyspark
>>> to
>>> >>>>>> somehow get loaded before the spark distribution provided
>>> pyspark,
>>> then the
>>> >>>>>> user would be alerted immediately.
>>> >>>>>>
>>> >>>>>> Davies, if you buy this, should me or someone on my
team pick up
>>> >>>>>> https://issues.apache.org/jira/browse/SPARK-1267 and
>>> >>>>>> https://github.com/apache/spark/pull/464?
>>> >>>>>>
>>> >>>>>> On Sat, Jun 6, 2015 at 12:48 AM Olivier Girardot
>>> >>>>>> &lt;

>> o.girardot@

>> &gt; wrote:
>>> >>>>>>>
>>> >>>>>>> Ok, I get it. Now what can we do to improve the
current
>>> situation,
>>> >>>>>>> because right now if I want to set-up a CI env for
PySpark, I
>>> have
>>> to :
>>> >>>>>>> 1- download a pre-built version of pyspark and unzip
it
>>> somewhere
>>> on
>>> >>>>>>> every agent
>>> >>>>>>> 2- define the SPARK_HOME env
>>> >>>>>>> 3- symlink this distribution pyspark dir inside
the python
>>> install
>>> dir
>>> >>>>>>> site-packages/ directory
>>> >>>>>>> and if I rely on additional packages (like databricks'
Spark-CSV
>>> >>>>>>> project), I have to (except if I'm mistaken)
>>> >>>>>>> 4- compile/assembly spark-csv, deploy the jar in
a specific
>>> directory
>>> >>>>>>> on every agent
>>> >>>>>>> 5- add this jar-filled directory to the Spark distribution's
>>> additional
>>> >>>>>>> classpath using the conf/spark-default file
>>> >>>>>>>
>>> >>>>>>> Then finally we can launch our unit/integration-tests.
>>> >>>>>>> Some issues are related to spark-packages, some
to the lack of
>>> >>>>>>> python-based dependency, and some to the way SparkContext
are
>>> launched when
>>> >>>>>>> using pyspark.
>>> >>>>>>> I think step 1 and 2 are fair enough
>>> >>>>>>> 4 and 5 may already have solutions, I didn't check
and
>>> considering
>>> >>>>>>> spark-shell is downloading such dependencies automatically,
I
>>> think if
>>> >>>>>>> nothing's done yet it will (I guess ?).
>>> >>>>>>>
>>> >>>>>>> For step 3, maybe just adding a setup.py to the
distribution
>>> would
>>> be
>>> >>>>>>> enough, I'm not exactly advocating to distribute
a full 300Mb
>>> spark
>>> >>>>>>> distribution in PyPi, maybe there's a better compromise
?
>>> >>>>>>>
>>> >>>>>>> Regards,
>>> >>>>>>>
>>> >>>>>>> Olivier.
>>> >>>>>>>
>>> >>>>>>> Le ven. 5 juin 2015 à 22:12, Jey Kottalam &lt;

>> jey@.berkeley

>> &gt; a
>>> écrit
>>> >>>>>>> :
>>> >>>>>>>>
>>> >>>>>>>> Couldn't we have a pip installable "pyspark"
package that just
>>> serves
>>> >>>>>>>> as a shim to an existing Spark installation?
Or it could even
>>> download the
>>> >>>>>>>> latest Spark binary if SPARK_HOME isn't set
during
>>> installation.
>>> Right now,
>>> >>>>>>>> Spark doesn't play very well with the usual
Python ecosystem.
>>> For
>>> example,
>>> >>>>>>>> why do I need to use a strange incantation when
booting up
>>> IPython if I want
>>> >>>>>>>> to use PySpark in a notebook with MASTER="local[4]"?
It would
>>> be
>>> much nicer
>>> >>>>>>>> to just type `from pyspark import SparkContext;
sc =
>>> >>>>>>>> SparkContext("local[4]")` in my notebook.
>>> >>>>>>>>
>>> >>>>>>>> I did a test and it seems like PySpark's basic
unit-tests do
>>> pass
>>> when
>>> >>>>>>>> SPARK_HOME is set and Py4J is on the PYTHONPATH:
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>>
>>> PYTHONPATH=$SPARK_HOME/python/:$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH
>>> >>>>>>>> python $SPARK_HOME/python/pyspark/rdd.py
>>> >>>>>>>>
>>> >>>>>>>> -Jey
>>> >>>>>>>>
>>> >>>>>>>>
>>> >>>>>>>> On Fri, Jun 5, 2015 at 10:57 AM, Josh Rosen
&lt;

>> rosenville@

>> &gt; >
>>> >>>>>>>> wrote:
>>> >>>>>>>>>
>>> >>>>>>>>> This has been proposed before:
>>> >>>>>>>>> https://issues.apache.org/jira/browse/SPARK-1267
>>> >>>>>>>>>
>>> >>>>>>>>> There's currently tighter coupling between
the Python and Java
>>> halves
>>> >>>>>>>>> of PySpark than just requiring SPARK_HOME
to be set; if we did
>>> this, I bet
>>> >>>>>>>>> we'd run into tons of issues when users
try to run a newer
>>> version of the
>>> >>>>>>>>> Python half of PySpark against an older
set of Java components
>>> or
>>> >>>>>>>>> vice-versa.
>>> >>>>>>>>>
>>> >>>>>>>>> On Thu, Jun 4, 2015 at 10:45 PM, Olivier
Girardot
>>> >>>>>>>>> &lt;

>> o.girardot@

>> &gt; wrote:
>>> >>>>>>>>>>
>>> >>>>>>>>>> Hi everyone,
>>> >>>>>>>>>> Considering the python API as just a
front needing the
>>> SPARK_HOME
>>> >>>>>>>>>> defined anyway, I think it would be
interesting to deploy the
>>> Python part of
>>> >>>>>>>>>> Spark on PyPi in order to handle the
dependencies in a Python
>>> project
>>> >>>>>>>>>> needing PySpark via pip.
>>> >>>>>>>>>>
>>> >>>>>>>>>> For now I just symlink the python/pyspark
in my python
>>> install
>>> dir
>>> >>>>>>>>>> site-packages/ in order for PyCharm
or other lint tools to
>>> work
>>> properly.
>>> >>>>>>>>>> I can do the setup.py work or anything.
>>> >>>>>>>>>>
>>> >>>>>>>>>> What do you think ?
>>> >>>>>>>>>>
>>> >>>>>>>>>> Regards,
>>> >>>>>>>>>>
>>> >>>>>>>>>> Olivier.
>>> >>>>>>>>>
>>> >>>>>>>>>
>>> >>>>>>>>
>>> >>>>>
>>> >>>
>>> >
>>> >
>>> >
>>> > --
>>> > Brian E. Granger
>>> > Cal Poly State University, San Luis Obispo
>>> > @ellisonbg on Twitter and GitHub
>>> > 

>> bgranger@

>>  and 

>> ellisonbg@

>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: 

>> dev-unsubscribe@.apache

>>> For additional commands, e-mail: 

>> dev-help@.apache

>>>
>>>





--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/PySpark-on-PyPi-tp12626p13637.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Mime
View raw message