spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ZHANG Wei <wezh...@outlook.com>
Subject Re: What is the best way to take the top N entries from a hive table/data source?
Date Wed, 22 Apr 2020 11:45:48 GMT
The performance issue might be caused by the parquet table partitions count, only 3. The reader
used that partitions count to parallelize extraction.

Refer to the log you provided:
> spark.sql("select * from db.table limit 1000000").explain(false)
> == Physical Plan ==
> CollectLimit 1000000
> +- FileScan parquet ... 806 more fields] Batched: false, Format: Parquet, Location: CatalogFileIndex[...],
PartitionCount: 3, PartitionFilters: [], PushedFilters: [], ReadSchema:.....
...PartitionCount: 3,...

According to the first email:
> val df = spark.sql("select * from table limit n")
> df.write.parquet(....)

You can try to recreate the parquet table with more partitions. Hope this page https://mungingdata.com/apache-spark/partitionby/
can help you.

---
Cheers,
-z
________________________________________
From: Yeikel <email@yeikel.com>
Sent: Wednesday, April 22, 2020 12:17
To: user@spark.apache.org
Subject: Re: What is the best way to take the top N entries from a hive table/data source?

Hi Zhang. Thank you for your response

While your answer clarifies my confusion with `CollectLimit` it still does
not clarify what is the recommended way to extract large amounts of data
(but not all the records) from a source and maintain a high level of
parallelism.

For example , at some instances trying to extract 1 million records from a
table with over 100M records , I see my cluster using 1-2 cores out of the
hundreds that I have available.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message