spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hyukjin Kwon (JIRA)" <>
Subject [jira] [Commented] (SPARK-28562) PySpark profiling is not understandable
Date Thu, 01 Aug 2019 07:24:00 GMT


Hyukjin Kwon commented on SPARK-28562:

Please ask questions to mailing list rather than filing as an issue. See

> PySpark profiling is not understandable
> ---------------------------------------
>                 Key: SPARK-28562
>                 URL:
>             Project: Spark
>          Issue Type: Question
>          Components: Optimizer
>    Affects Versions: 2.4.0
>            Reporter: Albertus Kelvin
>            Priority: Minor
> I was profiling code in PySpark. What I did was set the "spark.python.profile" in the
config to "true". I also made a simple method consisting of several dataframe operations,
such as "withColumn" and "join". Here's the code sample:
> {code:python}
> def join_df(df, df1):
> 	df = df.withColumn('rowa', F.lit(100))
> 	df = df.withColumn('rowb', df['rowa'] * F.lit(100))
> 	joined_df = df.join(df1,'rowid',how='left')
> 	return joined_df
> {code}
> However, after the driver exits, the output of the profiler was not understandable because
there were no my filename and the corresponding methods. All exists was Spark's built-in files
and methods, such as "", "", and "".
> The question is, how to show all of my methods that become the bottlenecks? For example,
using the above code sample, I'd like to know the time needed for "withColumn" and "join"
> Thanks.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message