spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Russell Spitzer <russell.spit...@gmail.com>
Subject Re: Why is Spark 3.0.x faster than Spark 3.1.x
Date Thu, 08 Apr 2021 13:33:04 GMT
Actually that only defaults to true in master ... so that may not be it ...

On Thu, Apr 8, 2021 at 8:28 AM Russell Spitzer <russell.spitzer@gmail.com>
wrote:

> Try disabling
> https://spark.apache.org/docs/latest/sql-performance-tuning.html#adaptive-query-execution
> Adaptive query execution, this would explain the different number of tasks
> post shuffle.
>
> On Thu, Apr 8, 2021 at 7:58 AM Mich Talebzadeh <mich.talebzadeh@gmail.com>
> wrote:
>
>> OK you need to asses where the versions have biggest impact in terms of
>> timings
>>
>> From spark GUI for each run under tab stages and completed stages, how
>> Duration took for each task and how are they different for identical tasks
>> in these two spark versions.
>>
>> Example
>>
>> [image: image.png]
>>
>> Our impactor is writing to Google BigQuery from on-premise with variable
>> timing because network we are using Cloud VPN (through public
>> network/internet) as opposed to Cloud interconnect (dedicated network).
>>
>> So in your case which stage is most time consuming?
>>
>> HTH
>>
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Thu, 8 Apr 2021 at 13:13, maziyar <maziyar.panahi@iscpif.fr> wrote:
>>
>>> Hi Mich,
>>>
>>> Thanks for the reply.
>>>
>>> I have tried to minimize as much a possible the effect of other factors
>>> between pyspark==3.0.2 and pyspark==3.1.1 including not reading csv or gz
>>> and just reading the Parquet.
>>>
>>> Here is a code purely in pyspark (nothing else included) and it finishes
>>> within 47 seconds in pyspark 3.1.1 and 15 seconds in pyspark 3.0.2:
>>> (still
>>> the performance hit is very large!)
>>>
>>> spark = SparkSession.builder \
>>>         .master("local[*]") \
>>>         .config("spark.driver.memory", "16G") \
>>>         .config("spark.driver.maxResultSize", "0") \
>>>         .config("spark.serializer",
>>> "org.apache.spark.serializer.KryoSerializer") \
>>>         .config("spark.kryoserializer.buffer.max", "2000m") \
>>>         .getOrCreate()
>>>
>>> Toys = spark.read \
>>>   .parquet('./toys-cleaned').repartition(12)
>>>
>>> # tokenize the text
>>> regexTokenizer = RegexTokenizer(inputCol="reviewText",
>>> outputCol="all_words", pattern="\\W")
>>> toys_with_words = regexTokenizer.transform(Toys)
>>>
>>> # remove stop words
>>> remover = StopWordsRemover(inputCol="all_words", outputCol="words")
>>> toys_with_tokens = remover.transform(toys_with_words).drop("all_words")
>>>
>>> all_words = toys_with_tokens.select(explode("words").alias("word"))
>>> # group by, sort and limit to 50k
>>> top50k =
>>>
>>> all_words.groupBy("word").agg(count("*").alias("total")).sort(col("total").desc()).limit(50000)
>>>
>>> top50k.show()
>>>
>>> This is a local test, just different conda environments one for
>>> pyspark==3.0.2 and one for pyspark==3.1.1, same dataset, same code, same
>>> sessions. I think this is a very easy way to reproduce the issue without
>>> including any third-party libraries. The two screenshots are actually the
>>> pinpoint of this issue as to why 3.0.2 has 12 tasks in parallel when
>>> 3.1.1
>>> has 12 tasks but 10 of them finish immediately while the other 2 are keep
>>> processing. (also, the CPU usage in 3.0.2 is full while in 3.1.1 is very
>>> minimal)
>>>
>>> Something is different in spark/pyspark 3.1.1 not sure if it's about the
>>> partitions, groupBy, limit, or just a conf being enabled or disabled in
>>> 3.1.1 resulting in these performance differences.
>>>
>>>
>>>
>>> --
>>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>
>>>

Mime
View raw message