spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From tiandiwoxin1234 <>
Subject Re: Problem using limit clause in spark sql
Date Sat, 26 Dec 2015 08:46:30 GMT
As for 'rdd.zipwithIndex.partitionBy(YourCustomPartitioner)', can I just drop some records
using my custom partitioner, otherwise I still have to call rdd.take() to get exactly 10000

And repartition is THE expensive operation that I want to walk around.

Actually, what I expect the limit clause would do is using some kind of coordinator to assign
each partition a number of records to reserve and the sum of which is exactly the limit(or
). But it seems this cannot be easily done.  

> 在 2015年12月25日,下午11:10,manasdebashiskar [via Apache Spark User List] <>
> It can be easily done using an RDD. 
> rdd.zipwithIndex.partitionBy(YourCustomPartitioner) should give you your items. 
> Here YourCustomPartitioner will know how to pick sample items from each partition. 
> If you want to stick to Dataframe you can always repartition the data after you apply
the limit. 
> ..Manas 
> If you reply to this email, your message will be added to the discussion below:
> To unsubscribe from Problem using limit clause in spark sql, click here <>.
> NAML <>

View this message in context:
Sent from the Apache Spark User List mailing list archive at
View raw message