spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ZHANG Wei <>
Subject Re: What is the best way to take the top N entries from a hive table/data source?
Date Tue, 21 Apr 2020 09:46:14 GMT may explain the question as below:

>  This patch preserves this optimization by treating logical Limit operators specially
when they appear as the terminal operator in a query plan: if a Limit is the final operator,
then we will plan a special CollectLimit physical operator which implements the old take()-based

For `spark.sql("select * from db.table limit 1000000").explain(false)`, `limit` is the final
for `spark.sql("select * from db.table limit 1000000").repartition(1000).explain(false)`,
`repartition` is the final operator. If you add a `.limit()` operation after `repartition`,
such as `spark.sql("select * from db.table limit 1000000").repartition(1000).limit(1000).explain(false)`,
the `CollectLimit` will show again.


From: Yeikel <>
Sent: Wednesday, April 15, 2020 2:45
Subject: Re: What is the best way to take the top N entries from a hive table/data source?

Looking at the results of explain, I can see a CollectLimit step. Does that
work the same way as a regular .collect() ? (where all records are sent to
the driver?)

spark.sql("select * from db.table limit 1000000").explain(false)
== Physical Plan ==
CollectLimit 1000000
+- FileScan parquet ... 806 more fields] Batched: false, Format: Parquet,
Location: CatalogFileIndex[...], PartitionCount: 3, PartitionFilters: [],
PushedFilters: [], ReadSchema:.....
db: Unit = ()

The number of partitions is 1 so that makes sense.

spark.sql("select * from db.table limit 1000000").rdd.partitions.size = 1

As a follow up , I tried to repartition the resultant dataframe and while I
can't see the CollectLimit step anymore , It did not make any difference in
the job. I still saw a big task at the end that ends up failing.

spark.sql("select * from db.table limit

Exchange RoundRobinPartitioning(1000)
+- GlobalLimit 1000000
   +- Exchange SinglePartition
      +- LocalLimit 1000000  -> Is this a collect?

Sent from:

To unsubscribe e-mail:

To unsubscribe e-mail:

View raw message