spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jeff Zhang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-16321) Pyspark 2.0 performance drop vs pyspark 1.6
Date Fri, 01 Jul 2016 17:31:11 GMT

    [ https://issues.apache.org/jira/browse/SPARK-16321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15359339#comment-15359339
] 

Jeff Zhang commented on SPARK-16321:
------------------------------------

This could due to a lot things, may be reading parquet file,  the python lambda or some other
spark sql changes from 1.6 to 2.0. First could you verify whether it is due to reading parquet
file by calling the following code

{code}
df = sqlctx.read.parquet(path)
df.where('id > some_id').count()
{code}

> Pyspark 2.0 performance drop vs pyspark 1.6
> -------------------------------------------
>
>                 Key: SPARK-16321
>                 URL: https://issues.apache.org/jira/browse/SPARK-16321
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.0.0
>            Reporter: Maciej BryƄski
>
> I did some test on parquet file with many nested columns (about 30G in
> 400 partitions) and Spark 2.0 is 2x slower.
> {code}
> df = sqlctx.read.parquet(path)
> df.where('id > some_id').rdd.flatMap(lambda r: [r.id] if not r.id %100000 else []).collect()
> {code}
> Spark 1.6 -> 2.3 min
> Spark 2.0 -> 4.6 min (2x slower)
> I used BasicProfiler for this task and cumulative time was:
> Spark 1.6 - 4300 sec
> Spark 2.0 - 5800 sec
> Should I expect such a drop in performance ?
> I don't know how to prepare sample data to show the problem.
> Any ideas ? Or public data with many nested columns ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message