spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hyukjin Kwon <gurwls...@gmail.com>
Subject [DISCUSS] Support pandas API layer on PySpark
Date Sun, 14 Mar 2021 01:57:04 GMT
Hi all,

I would like to start the discussion on supporting pandas API layer on
Spark.



If we have a general consensus on having it in PySpark, I will initiate and
drive an SPIP with a detailed explanation about the implementation’s
overview and structure.

I would appreciate it if I can know whether you guys support this or not
before starting the SPIP.
What do you want to propose?

I have been working on the Koalas <https://github.com/databricks/koalas>
project that is essentially: pandas API support on Spark, and I would like
to propose embracing Koalas in PySpark.



More specifically, I am thinking about adding a separate package, to
PySpark, for pandas APIs on PySpark Therefore it wouldn’t break anything in
the existing codes. The overview would look as below:

pyspark_dataframe.[... PySpark APIs ...]
pandas_dataframe.[... pandas APIs (local) ...]

# The package names will change in the final proposal and during review.
koalas_dataframe = koalas.from_pandas*(*pyspark_dataframe*)*
koalas_dataframe  = koalas.from_spark*(*pandas_dataframe*)*
koalas_dataframe.[... pandas APIs on Spark ...]

pyspark_dataframe = koalas_dataframe.to_spark()
pandas_dataframe = koalas_dataframe.to_pandas()

Koalas provides a pandas API layer on PySpark. It supports almost the same
API usages. Users can leverage their existing Spark cluster to scale their
pandas workloads. It works interchangeably with PySpark by allowing both
pandas and PySpark APIs to users.

The project has grown separately more than two years, and this has been
successfully going. With version 1.7.0 Koalas has greatly improved maturity
and stability. Its usability has been proven with numerous users’ adoptions
and by reaching more than 75% API coverage in pandas’ Index, Series and
DataFrame.

I strongly think this is the direction we should go for Apache Spark, and
it is a win-win strategy for the growth of both Apache Spark and pandas.
Please see the reasons below.
Why do we need it?

   -

   Python has grown dramatically in the last few years and became one of
   the most popular languages, see also StackOverFlow trend
   <https://insights.stackoverflow.com/trends?tags=python%2Cjava%2Cscala%2Cr>
   for Python, Java, R and Scala languages.
   -

   pandas became almost the standard library of data science. Please also
   see the StackOverFlow trend
   <https://insights.stackoverflow.com/trends?tags=python%2Cjava%2Cscala%2Cr>
   for pandas, Apache Spark and PySpark.
   -

   PySpark is not Pythonic enough. At least I myself hear a lot of
   complaints. That initiated Project Zen
   <https://issues.apache.org/jira/browse/SPARK-32082>, and we have greatly
   improved PySpark usability and made it more Pythonic.

Nevertheless, data scientists tend to prefer pandas libraries according to
the trends but APIs are hard to change in PySpark. We should redesign all
APIs and improve them from scratch, which is very difficult.

One straightforward and fast approach is to benchmark a successful case,
and pandas does not support distributed execution. Once PySpark supports
pandas-like APIs, it can be a good option for pandas users to scale their
workloads easily. I do believe this is a win-win strategy for the growth of
both pandas and PySpark.

In fact, there are already similar tries such as Dask <https://dask.org/>
and Modin <https://modin.readthedocs.io/en/latest/> (other than Koalas
<https://github.com/databricks/koalas>). They are all growing fast and
successfully, and I find that people compare it to PySpark from time to
time, for example, see Beyond Pandas: Spark, Dask, Vaex and other big data
technologies battling head to head
<https://towardsdatascience.com/beyond-pandas-spark-dask-vaex-and-other-big-data-technologies-battling-head-to-head-a453a1f8cc13>
.



   -

   There are many important features missing that are very common in data
   science. One of the most important features is plotting and drawing a
   chart. Almost every data scientist plots and draws a chart to understand
   their data quickly and visually in their daily work but this is missing in
   PySpark. Please see one example in pandas:




I do recommend taking a quick look for blog posts and talks made for pandas
on Spark:
https://koalas.readthedocs.io/en/latest/getting_started/videos_blogs.html.
They explain why we need this far more better.

Mime
View raw message