spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Abdeali Kothari <>
Subject Re: PySpark syntax vs Pandas syntax
Date Tue, 26 Mar 2019 06:32:04 GMT
Thanks for the reply Reynold - Has this shim project started ?
I'd love to contribute to it - as it looks like I have started making a
bunch of helper functions to do something similar for my current task and
would prefer not doing it in isolation.
Was considering making a git repo and pushing stuff there just today
morning. But if there's already folks working on it - I'd prefer

Note - I'm not recommending we make the logical plan mutable (as I am
scared of that too!). I think there are other ways of handling that - but
we can go into details later.

On Tue, Mar 26, 2019 at 11:58 AM Reynold Xin <> wrote:

> We have been thinking about some of these issues. Some of them are harder
> to do, e.g. Spark DataFrames are fundamentally immutable, and making the
> logical plan mutable is a significant deviation from the current paradigm
> that might confuse the hell out of some users. We are considering building
> a shim layer as a separate project on top of Spark (so we can make rapid
> releases based on feedback) just to test this out and see how well it could
> work in practice.
> On Mon, Mar 25, 2019 at 11:04 PM Abdeali Kothari <>
> wrote:
>> Hi,
>> I was doing some spark to pandas (and vice versa) conversion because some
>> of the pandas codes we have don't work on huge data. And some spark codes
>> work very slow on small data.
>> It was nice to see that pyspark had some similar syntax for the common
>> pandas operations that the python community is used to.
>> GroupBy aggs: df.groupby(['col2']).agg({'col2': 'count'}).show()
>> Column selects: df[['col1', 'col2']]
>> Row Filters: df[df['col1'] < 3.0]
>> I was wondering about a bunch of other functions in pandas which seemed
>> common. And thought there must've been a discussion about it in the
>> community - hence started this thread.
>> I was wondering whether there has been discussion on adding the following
>> functions:
>> *Column setters*:
>> In Pandas:
>> df['col3'] = df['col1'] * 3.0
>> While I do the following in PySpark:
>> df = df.withColumn('col3', df['col1'] * 3.0)
>> *Column apply()*:
>> In Pandas:
>> df['col3'] = df['col1'].apply(lambda x: x * 3.0)
>> While I do the following in PySpark:
>> df = df.withColumn('col3', F.udf(lambda x: x * 3.0, 'float')(df['col1']))
>> I understand that this one cannot be as simple as in pandas due to the
>> output-type that's needed here. But could be done like:
>> df['col3'] = df['col1'].apply((lambda x: x * 3.0), 'float')
>> Multi column in pandas is:
>> df['col3'] = df[['col1', 'col2']].apply(lambda x: x.col1 * 3.0)
>> Maybe this can be done in pyspark as or if we can send a pyspark.sql.Row
>> directly it would be similar (?):
>> df['col3'] = df[['col1', 'col2']].apply((lambda col1, col2: col1 * 3.0),
>> 'float')
>> *Rename*:
>> In Pandas:
>> df.rename(columns={...})
>> While I do the following in PySpark:
>> df.toDF(*[{'col2': 'col3'}.get(i, i) for i in df.columns])
>> *To Dictionary*:
>> In Pandas:
>> df.to_dict(orient='list')
>> While I do the following in PySpark:
>> { [row[i] for row in df.collect()] for i, f in
>> enumerate(df.schema.fields)}
>> I thought I'd start the discussion with these and come back to some of
>> the others I see that could be helpful.
>> *Note*: (with the column functions in mind) I understand the concept of
>> the DataFrame cannot be modified. And I am not suggesting we change that
>> nor any underlying principle. Just trying to add syntactic sugar here.

View raw message