spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Hyukjin Kwon <gurwls...@gmail.com>
Subject Re: [DISCUSS] Support pandas API layer on PySpark
Date Sun, 14 Mar 2021 11:03:11 GMT
Firstly my biggest reason is that I would like to promote this more as a
built-in support because it is simply
important to have it with the impact on the large user group, and the needs
are increasing
as the charts indicate. I usually think that features or add-ons stay as
third parties when it’s rather for a
smaller set of users, it addresses a corner case of needs, etc. I think
this is similar to the datasources
we have added. Spark ported CSV and Avro because more and more people use
it, and it became important
to have it as a built-in support.

Secondly, Koalas needs more help from Spark, PySpark, Python and pandas
experts from the
bigger community. Koalas’ team isn’t experts in all the areas, and there
are many missing corner
cases to fix, Some require deep expertise from specific areas.

One example is the type hints. Koalas uses type hints for schema inference.
Due to the lack of Python’s type hinting way, Koalas added its own (hacky)
way
<https://koalas.readthedocs.io/en/latest/user_guide/typehints.html#type-hints-in-koalas>
.
Fortunately the way Koalas implemented is now partially proposed into
Python officially (PEP 646).
But Koalas could have been better with interacting with the Python
community more and actively
joining in the design issues together to lead the best output that benefits
both and more projects.

Thirdly, I would like to contribute to the growth of PySpark. The growth of
the Koalas is very fast given the
internal and external stats. The number of users has jumped up twice almost
every 4 ~ 6 months.
I think Koalas will be a good momentum to keep Spark up.
Fourthly, PySpark is still not Pythonic enough. For example, I hear
complaints such as "why does
PySpark follow pascalCase?" or "PySpark APIs are difficult to learn", and
APIs are very difficult to change
in Spark (as I emphasized above). This set of Koalas APIs will be able to
address these concerns
in PySpark.

Lastly, I really think PySpark needs its native plotting features. As I
emphasized before with
elaboration, I do think this is an important feature missing in PySpark
that users need.
I do think Koalas completes what PySpark is currently missing.



2021년 3월 14일 (일) 오후 7:12, Sean Owen <srowen@gmail.com>님이 작성:

> I like koalas a lot. Playing devil's advocate, why not just let it
> continue to live as an add on? Usually the argument is it'll be maintained
> better in Spark but it's well maintained. It adds some overhead to
> maintaining Spark conversely. On the upside it makes it a little more
> discoverable. Are there more 'synergies'?
>
> On Sat, Mar 13, 2021, 7:57 PM Hyukjin Kwon <gurwls223@gmail.com> wrote:
>
>> Hi all,
>>
>> I would like to start the discussion on supporting pandas API layer on
>> Spark.
>>
>>
>>
>> If we have a general consensus on having it in PySpark, I will initiate
>> and drive an SPIP with a detailed explanation about the implementation’s
>> overview and structure.
>>
>> I would appreciate it if I can know whether you guys support this or not
>> before starting the SPIP.
>> What do you want to propose?
>>
>> I have been working on the Koalas <https://github.com/databricks/koalas>
>> project that is essentially: pandas API support on Spark, and I would like
>> to propose embracing Koalas in PySpark.
>>
>>
>>
>> More specifically, I am thinking about adding a separate package, to
>> PySpark, for pandas APIs on PySpark Therefore it wouldn’t break anything in
>> the existing codes. The overview would look as below:
>>
>> pyspark_dataframe.[... PySpark APIs ...]
>> pandas_dataframe.[... pandas APIs (local) ...]
>>
>> # The package names will change in the final proposal and during review.
>> koalas_dataframe = koalas.from_pandas*(*pyspark_dataframe*)*
>> koalas_dataframe  = koalas.from_spark*(*pandas_dataframe*)*
>> koalas_dataframe.[... pandas APIs on Spark ...]
>>
>> pyspark_dataframe = koalas_dataframe.to_spark()
>> pandas_dataframe = koalas_dataframe.to_pandas()
>>
>> Koalas provides a pandas API layer on PySpark. It supports almost the
>> same API usages. Users can leverage their existing Spark cluster to scale
>> their pandas workloads. It works interchangeably with PySpark by allowing
>> both pandas and PySpark APIs to users.
>>
>> The project has grown separately more than two years, and this has been
>> successfully going. With version 1.7.0 Koalas has greatly improved maturity
>> and stability. Its usability has been proven with numerous users’ adoptions
>> and by reaching more than 75% API coverage in pandas’ Index, Series and
>> DataFrame.
>>
>> I strongly think this is the direction we should go for Apache Spark, and
>> it is a win-win strategy for the growth of both Apache Spark and pandas.
>> Please see the reasons below.
>> Why do we need it?
>>
>>    -
>>
>>    Python has grown dramatically in the last few years and became one of
>>    the most popular languages, see also StackOverFlow trend
>>    <https://insights.stackoverflow.com/trends?tags=python%2Cjava%2Cscala%2Cr>
>>    for Python, Java, R and Scala languages.
>>    -
>>
>>    pandas became almost the standard library of data science. Please
>>    also see the StackOverFlow trend
>>    <https://insights.stackoverflow.com/trends?tags=python%2Cjava%2Cscala%2Cr>
>>    for pandas, Apache Spark and PySpark.
>>    -
>>
>>    PySpark is not Pythonic enough. At least I myself hear a lot of
>>    complaints. That initiated Project Zen
>>    <https://issues.apache.org/jira/browse/SPARK-32082>, and we have
>>    greatly improved PySpark usability and made it more Pythonic.
>>
>> Nevertheless, data scientists tend to prefer pandas libraries according
>> to the trends but APIs are hard to change in PySpark. We should redesign
>> all APIs and improve them from scratch, which is very difficult.
>>
>> One straightforward and fast approach is to benchmark a successful case,
>> and pandas does not support distributed execution. Once PySpark supports
>> pandas-like APIs, it can be a good option for pandas users to scale their
>> workloads easily. I do believe this is a win-win strategy for the growth of
>> both pandas and PySpark.
>>
>> In fact, there are already similar tries such as Dask <https://dask.org/>
>> and Modin <https://modin.readthedocs.io/en/latest/> (other than Koalas
>> <https://github.com/databricks/koalas>). They are all growing fast and
>> successfully, and I find that people compare it to PySpark from time to
>> time, for example, see Beyond Pandas: Spark, Dask, Vaex and other big
>> data technologies battling head to head
>> <https://towardsdatascience.com/beyond-pandas-spark-dask-vaex-and-other-big-data-technologies-battling-head-to-head-a453a1f8cc13>
>> .
>>
>>
>>
>>    -
>>
>>    There are many important features missing that are very common in
>>    data science. One of the most important features is plotting and drawing a
>>    chart. Almost every data scientist plots and draws a chart to understand
>>    their data quickly and visually in their daily work but this is missing in
>>    PySpark. Please see one example in pandas:
>>
>>
>>
>>
>> I do recommend taking a quick look for blog posts and talks made for
>> pandas on Spark:
>> https://koalas.readthedocs.io/en/latest/getting_started/videos_blogs.html.
>> They explain why we need this far more better.
>>
>>

Mime
View raw message